Awesome Agent Evals: 443+ Curated Links, 146 Deep Notes — A Real Awesome List
👤 Who This Is For
Not a Link Dump — A Truly Curated Agent Eval Resource Library
How many "awesome" lists have you seen on GitHub? Most are just link dumps where half the links are dead and the other half have no explanation why they're there.
The [awesome-evals](https://github.com/benchflow-ai/awesome-evals) list by BenchFlow is different. They claim it's a "non-BS" awesome list, and honestly, it lives up to the hype.
What Makes It "Non-BS"?
- **Depth-4 recursive citation crawl**: They crawled 11.6k papers and ranked them by in-degree to surface the academic canon
- **Targeted practitioner-web discovery**: They specifically hunted for industry sources that citation graphs miss (posts by Eugene Yan, Han-Chung Lee, Hamel Husain, etc.)
- **47 talks & podcasts**: All transcribed and deeply annotated (verbatim + timestamps)
- **Per-section gap audits**: With adversarial verification to ensure nothing important was missed
The result: **443+ curated links and 146 deep reading notes**. Every entry explains what it is and why it belongs here. URLs are checked, dead/abandoned tools are pruned.
The Real Treasure: The Patterns Playbook
A resource list alone isn't enough — they also created a [PATTERNS.md](https://github.com/benchflow-ai/awesome-evals/blob/main/PATTERNS.md) playbook with **runnable code and worked examples** covering:
- LLM-as-judge (how to align it with human judgments)
- pass@k/pass^k evaluation methods
- Error analysis
- Trajectory and world-state grading
- CI gating
- Verifiable rewards
Why You Should Care
AI Agents are exploding right now, but let's be honest — **most teams have no idea how to evaluate whether their Agent is actually good**. No evals = no improvement. That's an ML iron law.
This project brings together the best evaluation resources from academia and industry, plus runnable code. Whether you're building an Agent product or researching Agent capabilities, this is a must-bookmark resource.
⚠️ Heads up: The project was just created and is rapidly evolving. 🆕 marks resources released/updated in 2025-2026.
---
**Repo**: https://github.com/benchflow-ai/awesome-evals
**Maintained by**: BenchFlow (tagline: "Environments are the new data")