The bakeoff pattern: a reproducible way to answer "we're better than RAPP"
Every few months, a new multi-agent framework announces it has solved what we thought we solved. The claim lands in a conference talk or a launch tweet. “Smarter orchestration.” “Better memory.” “Fewer hallucinations.” “Production-ready.” The numbers behind those adjectives are almost never published alongside the adjectives.
We do not want to keep rebutting these claims in prose. Prose is slow and we get tired of writing it. We also do not want to keep writing fresh benchmarks from scratch every time a new framework shows up, because we get bored halfway through and cut corners.
So we built a thing. It lives at tools/bakeoff/. It is the reproducible pattern for “put up numbers or stop.”
What it measures, always the same
Seven cells, every time, same columns:
| Metric | Why it’s in the table |
|---|---|
| Files | The sacred tenet (SPEC §0) in one integer. |
| Lines of code | Catches “I inlined it but I import 40k lines of SDK.” |
| LLM calls / prompt | Hops are latency × cost × failure surface, compounded. |
| Total tokens (real) | From the provider’s usage field. Not estimated. Not extrapolated. Billable. |
| Unique outputs / N | Same prompt, N times, how many distinct answers? This is determinism, measured. |
| Wall time | Same concurrency, same LLM, same prompts. |
| Quality sample | Three printed outputs from each side. Eyeballs decide. |
Nothing else. We do not score “reasoning quality” with an LLM judge, because an LLM judging an LLM is not a measurement, it is a vibe. We do not score “ease of use,” because ease of use is not a number. We measure the things that can be counted and print them.
The shape of the harness
One entrypoint. One adapter per framework. One LLM client.
tools/bakeoff/
├── harness.py # the runner
├── adapters/
│ ├── base.py # Adapter ABC + RAPPAdapter baseline
│ ├── crewai_adapter.py # reference: Researcher -> Writer -> Reviewer
│ ├── langgraph_adapter.py # reference: extract -> plan -> execute -> critique
│ └── autogen_adapter.py # reference: writer <-> critic loop
├── llm_clients/
│ ├── azure_openai.py # real client; pulls `usage` straight from the response
│ └── stub.py # deterministic offline dry-run
├── corpora/default.json # the prompt set
└── README.md
An adapter is about 30 lines. The run_once(prompt, llm) method is the whole contract. You hand the framework the same llm callable everyone else gets. You return the final text. That is the entire coupling.
class MyFrameworkAdapter(Adapter):
name = "myframework"
file_count = 8 # your project files, real count
loc = 143 # real LOC of those files
framework_version = "1.2.0"
def run_once(self, prompt, llm):
plan = llm(f"PLAN: {prompt}", temperature=0.5)
draft = llm(f"WRITE: {plan}", temperature=0.7)
return Run(output=draft)
Drop the file into tools/bakeoff/adapters/. Register it in COMPETITORS. Done.
The rules that make it fair
There are four, and they are load-bearing:
- Same LLM. Both sides use the same Azure OpenAI deployment via the same client. We record provider-reported token counts, not estimates.
- Same concurrency. Both sides run through the same
ThreadPoolExecutor(max_workers=W). No side gets to be “async-optimized.” - Same corpus. The JSON file in
corpora/. Prompts are cycled if the corpus is shorter than N. Determinism is measured across same-prompt repetitions; variety is measured across distinct prompts. Both modes print. - Same temperature discipline. The competitor adapter picks temperatures that match the framework’s documented defaults. If the default is 0.7 per hop, it’s 0.7 per hop. No cherry-picking. If the framework’s defender wants to argue for a different number, they submit a PR changing the adapter.
If any of those rules slip, the number is not allowed in the table. Reply “fix the adapter” to any complaint.
What we did and what came out
We ran it once already. CrewAI-style, 100 prompts, live gpt-5.4 through Azure.
RAPP single-file CrewAI-style Delta
Files 1 13 13×
LOC 63 135 2.1×
LLM calls 100 300 3×
Total tokens (real) 20,160 74,059 3.67×
Unique outputs / 100 12 100 8×
Wall time 16.37 s 60.76 s 3.7×
Screenshot of the table. Post-commit hash. That is the whole conversation.
When to run it
- A framework announces it has solved multi-agent. Run it.
- A customer says “but I already have CrewAI, why RAPP?” Run it, against their actual workflow.
- An internal skeptic says RAPP doesn’t scale. Run it with N=1000.
- A new model ships. Run it against the previous model, same adapter. Watch both sides improve, watch the delta stay constant. That’s the architecture speaking.
What it is not
It is not a replacement for the 102-vs-langchain-crewai-autogen.md post. That post is the narrative. This is the number the narrative quotes.
It is also not a claim that RAPP wins on every task. If a workflow genuinely benefits from a critique loop, AutoGen’s adapter should win on quality — and the harness will show that. We trust the table more than we trust ourselves.
How to extend it
Three places, in descending order of what you’ll actually do:
- New competitor. Copy
crewai_adapter.py, change 20 lines, register. Two-minute job. - New corpus. Drop a JSON array of prompts into
corpora/. Pass--corpus path. - New metric. Add a column to
report(). Keep the column-order contract: RAPP first, competitor second, delta last. Do not remove columns.
The meta-point
The reason this post exists is that we realized we would spend the rest of our lives defending RAPP in prose if we did not instead ship a tool that defends it in numbers. Prose is expensive. Tools amortize.
If you show up at our door claiming a better mousetrap, we will hand you python tools/bakeoff/harness.py --competitor yourthing and we will go get coffee. When we come back, the table will have decided.