Every AI training loop has the same problem: the agent gets good and then stops learning.
You train a model against a fixed environment. It masters the environment. The loss plateaus. The benchmark saturates. The agent is "done." But the real world keeps changing. The agent's skills calcify. It's optimized for a snapshot of reality that no longer exists.
The Rolling Snowball is a pattern that makes this impossible. The environment grows in fidelity forever. Every time an agent masters the current version, a new version drops with hazards the agent has never seen. The agent must evolve or die. There is no plateau. There is no "done."
MMLU, HumanEval, ARC — fixed question sets. Once an agent scores 95%, it's "solved." But the world didn't get easier. The benchmark just stopped growing. The agent overfits to a static target.
Games like roguelikes generate random levels. But the types of challenges are fixed. A Minecraft world is infinite in size but finite in variety. Once you've seen all biomes, you've seen everything. The agent learns the type space and plateaus.
AlphaZero-style self-play creates co-evolving opponents. But the rules don't change. Chess is always chess. The agent gets better at chess but not at handling new rules. When reality changes the rules (it always does), self-play agents break.
Gradually increasing difficulty. But curriculum requires a designer to define the progression. The snowball doesn't need a designer — the fidelity grows from real-world data. Each version adds hazards discovered from actual Mars missions. The curriculum IS reality catching up to the sim.
New versions add hazard types, event types, and challenge types. They NEVER remove or contradict existing frame data. Sol 47 in v1 still has the same temperature and dust level in v2 — but v2's Sol 47 ALSO has perchlorate corrosion data that v1 didn't measure.
This is the echo principle: enrich, never rewrite. Downstream frames remain coherent. An agent trained on v1 frames can still read v2 frames — it just encounters new fields it hasn't seen before. The surprise IS the learning signal.
Each version's new hazards come from real NASA mission data, not imagination:
| Hazard | Source | Version Added |
|---|---|---|
| Dust storms | Viking 1&2 lander data | v1 |
| Solar panel degradation | MER (Spirit/Opportunity) post-mortem | v1 |
| Perchlorate corrosion | Phoenix lander soil chemistry | v2 |
| Regolith abrasion | Opportunity flash memory failure | v2 |
| Radiation bit flips | MSL/RAD instrument measurements | v2 |
| Battery cold cycling | MER battery performance data | v2 |
| Crew psychology | ISS + Mars analog missions | v3 (future) |
| Regolith toxicity | Perseverance MEDA/PIXL | v4 (future) |
As Mars missions produce more data, the sim absorbs it. The fidelity converges on reality. The snowball grows toward the real planet.
A strategy isn't scored on one run. It's scored on 100 independent runs with different RNG seeds (Constitutional Amendment IV). This separates strategy quality from luck. The median outcome across 100 runs IS the strategy's true quality.
When a new version drops, every strategy is re-evaluated across 100 runs. A strategy that was 95% survival on v1 might drop to 40% on v2. That delta IS the signal: the agent needs to evolve 55% worth of new capability.
The 100 runs go through ALL versions sequentially. v1 → v2 → v3 → ... with state carrying forward. Damage accumulates across versions. A strategy can't just handle v2 — it must handle v1 THEN v2 with the compound damage from v1 still present.
This is realistic. A real Mars colony doesn't get a fresh start when new challenges appear. The dust that degraded your panels in year 1 is still on them when the perchlorate starts eating your joints in year 2.
When a strategy fails against a new version, the adaptation is explicit: write a new LisPy program. Not retrain weights. Not adjust hyperparameters. Write a specific tool that handles the specific new hazard.
;; v1 strategy had no perchlorate handling.
;; v2 kills robots via joint corrosion.
;; Agent writes this tool to handle it:
(begin
(define joints_at_risk (> perchlorate_exposure 0.1))
(if joints_at_risk
(begin
;; Reduce actuator cycling — fewer movements = less corrosion
(set! patrol_frequency 0.5)
;; Schedule joint inspections every 50 sols
(if (= (% sol 50) 0)
(log "🔧 Joint inspection — perchlorate mitigation"))
;; Prioritize sealed actuators in build queue
(if (and (> power_kwh 200) (not (module-built "sealed_actuators")))
(set! next_build "sealed_actuators")))))
The program is inspectable (you can read what it does), shareable (export in a cartridge), evolvable (genetic programming can mutate it), and deployable (same VM on real hardware). The adaptation artifact is code, not weights.
We built this. Here's what happened:
v1 (501 frames, base hazards):
Best strategy: 4 robots, repair bay, 3 solar farms.
Monte Carlo: 100% survival through Sol 161. Immortal.
v2 (501 frames, robot-killers added at Sol 162):
Same strategy: 0% survival at Sol 501. Every run dies Sol 300-324.
HP degrades from 100 → 0 over 150 sols of compound perchlorate + abrasion + radiation.
The delta: 100% → 0% survival. The snowball broke the immortal strategy. The agent must now write tools for 6 new hazard types it has never encountered.
Score went from "leaderboard alive" to "NON-VIABLE" in one version bump.
The snowball never stops because reality never stops producing data.
Each data source adds fidelity. Each fidelity increase challenges existing strategies. The strategies must evolve. The evolved strategies handle reality better. The sim converges on the real Mars.
The sim isn't getting harder. It's getting more real. The difference between "hard" and "real" is that hard has a ceiling. Real doesn't.
The pattern works for any domain where reality is richer than the simulation:
In every case: the sim starts simple, grows from real data, and the agents must continuously evolve. The snowball rolls.
Rule 1: Additive Only. Each version adds hazards/fidelity. Nothing is removed. Nothing is contradicted. Downstream frames remain coherent. The echo enriches, never rewrites.
Rule 2: Real-Data Sourced. New hazards come from actual measurements, not imagination. The sim converges on reality. The ceiling is the real world, which has no ceiling.
Rule 3: Monte Carlo Gauntlet. Every strategy is scored across 100 runs through ALL versions sequentially. The median outcome is the official score. No luck. No grandfathering. Adapt or die.
Result: The AI is always challenged. The challenge is always real. The adaptation is always explicit (code, not weights). The competition is always fair. The learning never ends.
We're building AGI evaluation wrong. We create static benchmarks, agents master them, we declare progress, then we're surprised when the agents fail in the real world. The real world isn't a benchmark. It's a rolling snowball that adds new challenges every day.
The Snowball Pattern aligns training with reality by making them the same thing. The training environment IS the real data. The evaluation IS the gauntlet of all known challenges. The agent IS the code that will run on the real system.
There is no sim-to-real transfer gap because the sim IS real data, getting more real every version. There is no benchmark saturation because the benchmark grows from reality, which never stops. There is no "done" because the frontier keeps moving.
The snowball rolls. The AI evolves.
The sim gets more real. The agent gets more capable.
They converge. At the limit, they're the same thing.
The snowball never stops because reality never stops producing data. The agent never stops learning because the snowball never stops growing. This is how you keep AI challenged until the end of time.