Distyl
Introduction
This Fall Quarter of 2024, our team had the opportunity to build a golden set manager for Distyl. This internal organizational tool supports their development of LLM-driven solutions. Distyl focuses on building enterprise AI solutions for its clients’ needs in various sectors, including healthcare, IT, and finance.
Timeframe
October — December 2024; 8 weeks
Tools
- Design: Figma
- Development: React, TypeScript, Material UI, FastAPI (Python), Postgres, SQLAlchemy
- Organization: Github, Jira, Notion, Slack
The Team
The Client
Distyl partners with Fortune 500 companies to drive $100M+ mission-critical initiatives through AI-powered automation of their core operations. In collaboration with OpenAI on their top accounts, they deliver transformative outcomes, already impacting F50s in telecom, manufacturing, healthcare and retail. The team comprises deeply accomplished researchers and engineers in AI systems, and is backed by top investors like Lightspeed Venture Partners, Khosla and Coatue.
Our Task
To optimize an LLM model, Distyl engineers must iteratively evaluate the accuracy of their generative AI model with every update they make to the data or model infrastructure. Unlike in supervised AI, where there is a well-defined real answer to compare the AI model’s output to, unsupervised AI requires not only a way to define expected outputs (valid answers) to use as a baseline but also a way for humans to easily compare the valid answers with their current outputs (LLM’s answers).
Without an efficient way to evaluate LLM data, our main goal is to create an application that displays the data for easy comparison between the expected outputs and current outputs to improve model improvement practices, foster collaboration between AI engineers on projects, and standardize the way Distyl stores and visualizes their LLM evaluation datasets, known as golden sets.
Design
User Research
We asked AI strategists to walk us through the process of working with LLM data. Through this, we identified common current platforms they used, such as Google Sheets, and also understood the pain points and challenges they conveyed when using these platforms.
With these user interviews, we performed a whiteboarding exercise that detailed user needs, the problem space, the solution space, and assumptions about the product’s functionality that we can intuitively understand based on the variety of potential features emphasized by our user interviews. This exercise helped us define an overarching goal for our product. We asked ourselves, “how might we create a centralized platform for our AI strategists to manage data?” While the most challenging part of the project was understanding our user needs and confirming if our assumptions were valid, it was a necessary step to ensure that our features were clearly defined to ease the development process and create a desirable, functional application.
By hearing and thinking about challenges our target users were facing, we aimed to develop a centralized, user-friendly platform for our target users to organize their golden sets and evaluate their LLM’s that would serve as a template for further improvements tailored to the client’s needs. These primarily addressed the issues of the users interviewed, such as lack of readability, which prevented meaningful interaction with the data, and lack of collaboration on a project, which is vital to ensure collaborators can check each other’s validations.
Ideation
Following our interviews with target users and discussions with our clients, our objective was to propose a solution for every problem defined in the problem space. A notable challenge was not only to define solutions that appealed to all our users but also to incorporate the feedback and suggestions provided by the AI strategists who followed up with us throughout the entire cohort. To overcome this challenge, we held weekly ideation sessions to refine old solutions and discuss new solutions. Even if this process was labor-intensive, it provided us with clear, well-defined features that we could easily prioritize for the developers during product development.
- Tabular layout: For easily navigating and visualizing data, whether it be the golden set as a whole or evaluating a specific data point
- Dynamic data management: For fostering adequate communication between team members on the same project by allowing users to “flag” or “prioritize” specific data points
- Version access: For allowing a clear understanding of overall project progress by specific team members over time, or to revert to an old version to fix any mistakes made during evaluation
These primary features would serve to be the framework of the application we envision building, successfully establishing valuable features to build and meet user expectations, all while falling under the project goals and scope.
Low-Fidelity Sketches
During our ideation process, we created multiple low-fidelity sketches. These sketches helped us narrow down our ideas to solve our problem space. To make a centralized platform for data management, we first needed to focus on efficient data visualization. We needed to create an application that is organized and easy to read. Initially, some of our sketches included a panel that extended from the bottom of the screen. However, after discussion, we realized that it would be more intuitive to have a side-by-side panel.
Having a panel that extended from the bottom of the screen would block the visibility of some data points. As a result, users would be unable to scan the document. In contrast, a side-by-side panel would allow users to view all data points and compare the data table with selected data points, ultimately increasing efficiency for data visualization. It would also always showcase important information like the input, expected output, and current output.
Mid-Fidelity Wireframes
During the development of the mid-fi’s, we began to expand beyond the scope of the primary features and started developing more specific features that were more directly addressed to meet our defined goal. Features such as evaluating a specific data point and manually editing input text from a side-by-side panel were all briefly discussed during the low-fi phase and were successfully more well-defined during the creation of mid-fi’s.
The form of visualization proposed was not only more familiar with the target users who primarily come from using tools such as Google Sheets and Notion but also more easily aligned with the industry standard of putting more global features to the left of the screen and getting more specific to the application as you move to the right. In our designs, you’re able to access different projects on the left, visualize a specific project as a whole around the middle, and access/edit a specific data point on the right.
High-Fidelity Prototype
During the development of high-fi’s, we resolved to incorporate feedback from people who utilized our application as well as our clients. Our sorting system units, which were originally organized based on “completion” were redefined to “priority,” to clarify that data points should be prioritized based on how many people have evaluated it, not based on a single collaborator’s belief on whether a data point is valid. We also added smaller features within the features we already defined, such as highlighting text boxes that intuitively inform the user that they can edit parts of the golden set. Finally, we cleaned up and clarified other objects such as buttons for a more accessible and understandable design for project handoff.
Development
Output Comparison
The web application allows users to compare the outputs of Distyl’s LLM tool output and the expected output. The user can then refine the current output generated using the application’s model by directly referencing the expected output. This establishes an accuracy baseline for ongoing improvements, aiding Distyl in identifying areas for development.
Additionally, the application streamlines workflows by allowing users to prioritize specific inputs and flag particular instances for further review. These features provide actionable feedback to the development team, facilitating continuous optimization of the model and ensuring the platform consistently delivers high-quality results.
Global History
The platform’s global history feature allows users to track all changes made to golden sets in real-time. Using SQLAlchemy-Continuum, this version control system attempts to track and store changes for each modification comprehensively and is already used commonly in Distyl applications. The initial implementation was challenging as there were limited community resources; we were specifically seeking tutorials and example projects demonstrating SQLAlchemy-Continuum integration. However, the framework’s documentation proved helpful as it was organized and outlined the versioning features and implementation steps.
This version control capability enhances team collaboration on shared projects, with every interaction — from updates to insertions/deletions — being carefully recorded. Team members can trace changes and ensure data integrity, while the complete audit trail enables version review and accountability. The feature is particularly valuable in client projects where multiple team members work simultaneously, minimizing conflicting modifications.
Database Accessibility
One of the core features of our application is building the database to store data points and user information. Maintaining different input/output pairs and data file variations was crucial to the database structure, as the basic functionality of our application is storing, accessing, and saving data. However, with the implementation of version control, we also implemented other parameters such as user data to be able to log the changes by a specific user that occurred in the database for every project edit.
Takeaways
Challenges
One challenge we initially faced was understanding the tech stack and the project starter code. We decided to place developers who didn’t have much experience with the backend framework on the backend team, mainly to help them gain more exposure. This meant they had to learn things they weren’t familiar with, which slowed down our progress — it was especially tough given our short timeline. We considered changing developers, but by then it was too late since the backend developers had already invested time learning. Switching at that point would have taken even more time. Overall, it took a while to get everyone on the same page, but once we did, the ball rolled quickly.
The version history feature, allowing our client to navigate back to old changes, turned out to be a new and unexpected addition to the scope. While it was a great idea, it was difficult to implement because none of us had done something like this before. We decided to use SQLContinuum based on our client’s suggestion, but integrating it into the codebase was challenging; furthermore, as there was already an existing database schema, it was difficult to merge the existing tables and columns with SQLContinuum’s automatically generated tables and columns. Given the short timeline and limited resources, we had to limit the version history functionality to certain parts of the application, such as only building a version history element for the full project.
User interviews — conducted by the designers — were extremely helpful in giving us more context about how the application would be used. However, they also introduced new features and priorities, requiring us to adjust our project scope and priorities frequently. This constant readjustment took time away from our already short eight-week timeline. Furthermore, the new features suggested were incredibly important, which made it tricky to determine which were to be classified as a top priority. After going back and forth for a while, we were finally able to define a set of more realistic goals approaching the end of the cohort.
Next Steps
While the core functionalities of our application are complete, as well as features that ensure a smoother user experience, many features weren’t in the project scope. Nonetheless, these features would prove to be extremely valuable if added to the application, which include:
- Version history for specific data points for in-depth data point evolution analysis
- Displaying/editing of other influential features to an LLM on the specific data point page, such as sources for RAG
- Commenting for more specific communication between team members
Closing Remarks
We would like to thank Edward Chew and David Liusk, our points of contact at Distyl, for being extremely helpful and responsive representatives throughout the term of developing this application. Furthermore, we’d like to thank Harshini, Mariah, Patrick, Will, and Jenn, the Distyl team members we interviewed who helped us build a vision for us to move towards. We hope that the product we’ve created will be equally as useful to members at Distyl as they’ve been to us!