Coding Agents Designing Evals

I got my coding Agents to design evals and continually improve the agentic app I am building.

Here are a few of my prompts:

“create an interface for yourself to execute the Agent in isolation and provide mocked tool inputs. Hook yourself into the OpenTelemetry traces, so you can see exactly what’s going on.”

“although this is a typescript project, do not be shy to use python tools like ragas or haystack. We dont want to reinvent the wheel. Executing scripts is okay”

“the agents seem to recall data from training memory. design a test suite to find a prompt and tooling design to reduce that tendency. we want full grounding”

“refer to the @deep_research_report.md and create an eval bench to test the hypothesis”

“your current synthetic test set is too simple. create a complex multistep problem in the REDACTED domain”

So now a team lead agent is leading a team of engineer and researcher agents to create a team of agents that can solve my problems. 🤖