Bake off your own stack — a migration assessment for enterprise teams
If your team is running CrewAI, LangGraph, AutoGen, or a homegrown multi-agent framework today, this post is your first 90 minutes of due diligence on RAPP. You don’t have to migrate. You don’t have to commit. You just run the harness.
The output is a numeric report you can take to your tech lead, your platform team, or your CFO.
What you’ll have at the end
A directory like this:
tools/bakeoff/run_artifacts/yourframework__20260419-141500/
├── summary.json # the seven-cell table
├── rapp_outputs.json # what RAPP produced for your prompts
├── yourframework_outputs.json # what your stack produced
└── diff_sample.txt # first three distinct outputs from each side
summary.json answers: “If we migrated this workflow, what would change?”
The 90 minutes, broken down
| Step | Time | Output |
|---|---|---|
| 1. Clone RAPP, install deps | 10 min | working harness |
2. Write your adapter (<yourframework>_adapter.py) |
25 min | one file, ~30 lines |
| 3. Export your prompts to a JSON corpus | 10 min | corpora/yourstack.json |
| 4. Write the equivalent RAPP single-file agent | 20 min | one *_agent.py |
| 5. Run the harness, both sides, n=100 | 15 min | summary.json |
| 6. Read the table, write up findings | 10 min | the slide |
If you’ve never seen RAPP before, add 30 minutes to read SPEC.md first. It’s worth it.
Step 2 in detail — your adapter
The harness expects an adapter that exposes one method:
from tools.bakeoff.adapters.base import Adapter, Run
class YourStackAdapter(Adapter):
name = "yourstack"
file_count = 8 # count YOUR project files (not the framework's lib)
loc = 143 # wc -l them
framework_version = "1.2.0"
def run_once(self, prompt, llm):
# Replace these with your actual workflow.
# `llm(prompt, temperature=...)` is the harness-provided LLM.
# Use the same default temperatures your framework uses in production.
notes = llm(f"PLAN: {prompt}", temperature=0.7)
draft = llm(f"WRITE: {notes}", temperature=0.7)
return Run(output=draft)
That’s it. Drop into tools/bakeoff/adapters/yourstack_adapter.py, register in harness.py’s COMPETITORS dict, run.
The adapter does not need to invoke your real framework. It needs to mimic the LLM call pattern your framework produces under load: same number of hops, same temperatures, same prompt scaffolding shape. The harness measures the wire, not the framework.
Step 3 in detail — your corpus
Pull 25 to 100 representative prompts from your production logs. Strip PII. Save as JSON:
[
"Summarize this RFP in three bullets: ...",
"What are the risks in this contract: ...",
"Generate a customer reply for: ..."
]
Drop in tools/bakeoff/corpora/yourstack.json. The harness will use it via --corpus.
If your prompts vary widely in length or shape, run the bakeoff on subsets too: --corpus contracts.json separately from --corpus replies.json. The deltas often differ by workload class.
Step 4 in detail — your RAPP agent
This is where the real work is. Take your multi-agent workflow and ask: “what would this look like as a single *_agent.py?”
Walk-through:
- Identify the one final answer the user needs. Not the intermediate steps. The thing they actually consume.
- Identify the structured signals each intermediate step produces (entities, scores, classifications). These become
data_slush. - Write one
perform()that produces the final answer, calling the LLM as few times as possible (often once), and emitting the slush signals as typed output.
The example in 119-writing-your-first-rapplication-today.md is a good template. The reference agents/summarizer_agent.py in the repo is a working minimal example.
You’ll likely find the RAPP version is shorter than you expected. That’s not because RAPP is magic — it’s because your multi-agent workflow had ceremony that wasn’t doing useful work. The bakeoff makes the ceremony visible.
Step 5 in detail — the run
set -a; . RAPP/.env; set +a # or export AZURE_OPENAI_* yourself
python tools/bakeoff/harness.py \
--competitor yourstack \
--corpus tools/bakeoff/corpora/yourstack.json \
--rapp-agent path/to/your_rapp_agent.py \
--rapp-class YourRappAgent \
--n 100 --workers 12
Cost: roughly $0.50 to $5.00 in Azure OpenAI charges, depending on prompt size and your model tier. Time: 1 to 5 minutes.
Step 6 — the slide
The cleanest version:
Workload: <name> N=100 prompts Model: <gpt-5.4 or whatever>
Today RAPP Delta
Files N 1 Nx
LLM calls/prompt H 1 Hx
Tokens (real) T1 T2 T1/T2
Unique outputs U1 U2 U1/U2
Wall time W1 W2 W1/W2
Fill the cells from summary.json. That’s the slide. Show it to your tech lead.
What the slide tells you
If the RAPP delta on Files, Tokens, and Wall time are all > 1.5×, migrating that workload to RAPP would measurably improve cost and latency.
If the delta on Unique outputs is large (RAPP smaller, framework larger), migrating that workload would measurably improve determinism, which is what enables caching, testing, and reliable downstream automation.
If all the deltas are near 1.0, you’re already shipping a near-RAPP architecture and you might not need to migrate. Consider this an audit pass.
If RAPP loses on any dimension, investigate why. The harness’s diff_sample.txt shows the actual outputs side by side. Often the issue is your RAPP agent’s soul is under-specified, not that RAPP is wrong for the workload.
The migration question
The bakeoff doesn’t tell you to migrate. It tells you what migrating would measure as. Whether the measurements justify the engineering cost is your call. Things to factor:
- One-time engineering cost. Rewrite N agents as M single-file agents. Usually M < N.
- Operational savings. Token bill, latency, support tickets caused by variance.
- Strategic optionality. RAPP agents run on three tiers unmodified. Your current agents probably don’t.
- Risk of doing nothing. Today’s framework is one major release away from breaking your code. RAPP v1 is frozen.
If the math says migrate, migrate one workflow. Run the bakeoff again at quarter end. Decide on the next one.
The honest version
We didn’t build the bakeoff to convert you. We built it because we got tired of arguing with people whose claims were unfalsifiable. The harness makes both your stack and ours falsifiable. Anyone can run it. Anyone can dispute it. The tablesettle the conversation.
If your stack wins the bakeoff, we want to know. We’ll learn from your adapter and probably steal an idea or three. If RAPP wins, we want you to know — but only because you measured it, not because we told you.
Run the bakeoff. The numbers will pick the conversation.