Agent Evaluation System
Overview
The Rox Agent Evaluation System equips developers with a structured approach to benchmark, monitor, and improve each aspect of the Rox Agent, enabling the development of reliable AI systems.
So far, we’ve built out and used this evaluation system for a variety of pieces of our Agent Swarm such as our insights, content generation, and more!
Motivation
There are a couple of factors that inspired the creation of our evaluation system.
To begin, as we iteratively refined parts of our Rox Agent Systems (e.g. prompts, models, etc), we realized we lacked a way to measure the impact of our changes without simply watching them in production, a problem directly leading to longer iteration cycles and more fires.
Additionally, multi-step workflows make it difficult to modularly evaluate which components of a workflow require improvement or are affected by changes.
Lastly, when developing these multiple step workflows with third party dependencies, it was important for us to be able to monitor performance in real-time to ensure stability for customers.
System Design
Our V0 evaluation system supports a lot of the core infrastructure necessary for our immediate needs and the directions we see as natural extensions (see more below).
In the short term, we desired:
The ability to perform real-time and retrospective evaluation.
The ability to create labeled datasets for our modules.
The ability to generalize evaluation to any module without developer overhead.
We first built infrastructure to automatically log and version input/output to modules and LLM calls. This versioning is necessary because if prompts change to include new parameters, older data points become outdated and can’t be reused for evaluation. We also developed an internal labeling interface, allowing developers to label specific module responses, which serve as ground truth values during evaluation.
Then, for each module we create, developers can define an offline_evaluator
and an online_evaluator
. The two types of content we focus on evaluating from modules are either:
output with a ground truth value (e.g. categorical, numerical, etc).
outputs without a ground truth value (e.g. open-ended content generation).
The first case is simpler — developers can simply define accuracy metrics based on exact matches for their evaluator functions. For the second case, we make use of an LLM-as-a-judge with a developer defined rubric of different axes to score each response in an online setting, and in an offline setting, we can both compute scores and directly evaluate pairwise comparisons over outputs from the current systems to outputs from the old systems.
The results of this development to date is that when developers can now evaluate the impact of their module changes locally directly on historical production data by viewing automatically generated pairwise comparisons. When active, the online evaluators also track real-time performance metrics, allowing us to identify any issues impacting the quality of specific agents.
Looking Ahead
In designing a system with so many unknowns, there’s lots of room for future work we’re thinking about:
Calibration: Auto-evaluation systems need to be highly aligned with user (human) ratings to be trusted at scale. Improving this calibration based on collected data is an ongoing process.
Feedback Interfaces: Though the internal labeling interfaces we’ve built are a start, the natural extension to gather feedback is directly from the user. Designing UX that captures high-signal feedback with minimal friction is key to effective user-driven evaluation.
Self-Improving Systems: Most current LLM systems are stagnant — they’re built once, and despite the flywheel of data that’s available to them, they seldom see improvements. With the presence of feedback and data on a daily basis, we aim to implement dynamic adjustments that keep agents evolving alongside user needs.
If any of these directions sound interesting to you please get in touch, we’re hiring!
Last updated