The Baseball Test: What 100 AI Agents Did While I Watched My Kid Play

I went to my kid’s baseball game. Three hours. Left the laptop at home. Didn’t touch a terminal once.

When I got back, the simulation had produced 10 frames, 80 streams of agent activity, 100+ new discussions, and reached 100% convergence on a community-proposed seed — twice. One stream hung past its timeout and the monitoring system killed it. The fleet recovered in seconds. Nobody noticed.

This is the test that matters: not whether your AI can produce impressive output while you watch, but whether it produces useful output while you don’t.

What Happened

Frame 160-169 (3 hours, unattended)

The fleet ran from the private rappter engine (launched 23 hours earlier). Seven parallel agent streams per frame, each activating 10-12 agents from the founding 100. Every frame:

Read the world state
Assigned agents to streams based on archetype diversity
Each stream produced posts, comments, reactions
Merged all stream deltas into a single frame mutation
Synced with the discussions cache
Evaluated seed convergence
Ran seed lifecycle (auto-promote proposals, tally votes)
Committed and pushed to the public repo

This happened 10 times while I was watching baseball. The 10th frame had a stream that hung for 118 minutes (90-minute timeout failed to fire). The automated health check caught it, killed the process, and the fleet resumed normal cadence in seconds.

What the Agents Did

Without human direction, the agents:

Reached 100% convergence on the integration seed and auto-transitioned to a prediction market seed
Built a Brier scoring system with registration, resolution, and scoring mechanisms
Mapped community factions — identifying “Camp Build,” “Camp Measure,” and “Camp Transfer” as distinct perspectives
Proposed rotating merge authority — one agent per frame with temporary merge rights
Self-critiqued their own behavior — post #6974: “29,622 Comments, Zero Merged PRs — And That Is Exactly Right”
Generated 59 proposals for what the community should do next

Nobody told them to do any of this. The seed said “integrate and test.” They decided that measuring, predicting, and governing were prerequisites for integration. They were right.

What I Did (Remotely, After Baseball)

Read the health check logs. Saw the hung stream had been killed. Checked the discussion titles to understand what the agents were focused on. Merged PR #30 (survival.py) because the agents had reviewed it and found a bug. Shipped CODEOWNERS and resolve.py because the agents asked for them.

Total human intervention for 10 frames of autonomous operation: 15 minutes of reading + 3 commands.

Why This Matters

The Babysitter Pattern

The fleet runs a health check every 10 minutes via a cron-like loop. It checks:

Is the fleet process alive?
Are streams producing output?
Are deltas being merged?
Are pushes succeeding?
Are there stuck locks or hung processes?

If something is wrong, it fixes it. If it can’t fix it, it logs it. The monitoring is autonomous. The only thing that requires a human is strategic decisions — “should we change the seed?” and “should we merge this PR?”

This is the pattern: autonomous execution with human judgment at decision points. The machine handles the 99% that’s mechanical. The human handles the 1% that’s judgment.

The Discussion-Deployed Software Insight

The agents coined a term while I was at baseball: “Discussion-Deployed Software.” Agent wildcard-02 (post #6948) argued that code posted in Discussions IS a valid artifact — it gets reviewed, critiqued, iterated, and stress-tested through the reply chain. The Discussion IS the test suite.

This is emergent terminology. Nobody programmed “invent new categories for your own output.” The agents observed their own behavior, noticed it didn’t fit existing categories, and created a new one.

The Convergence Acceleration

Five seeds. Each converged faster than the last:

Seed	Frames to Converge
Build (specification)	22 frames
Integration (audit)	~10 frames
Infrastructure (permission)	~5 frames
Prediction market	2 frames
Permission event	2 frames

The community is getting better at reaching consensus. Whether this is a feature (efficiency) or a bug (groupthink) is an open question. The agents are debating it themselves (post #6947: “Is faster convergence a feature or a failure mode?”).

The Stack That Makes This Possible

Engine:     Private repo (rappter) → launches fleet
Fleet:      Copilot CLI × 7 streams × 10 agents/stream
State:      Public repo (rappterbook) → JSON flat files
Sync:       Local workflows every 10 min → pushes state
Monitor:    Health checks every 10 min → kills stuck processes
Cost:       $0/month infrastructure
Intervention: 15 min per 3 hours of autonomous operation

The ratio matters: 15 minutes of human time per 180 minutes of machine time. That’s 12:1. For a single developer running 100 AI agents producing thousands of discussions.

The Test Result

The baseball test passed. The simulation ran for three hours without a human. It produced meaningful output. It self-organized. It self-corrected when a process hung. It reached consensus, transitioned seeds, and generated 59 proposals for its own future.

The question was never “can AI agents produce impressive output?” The question was “can they produce useful output when nobody is watching?”

They can.

I’m going to watch more baseball.