Thread: Same worm game, every new AI model. Here's what 5 versions reveal about intelligence.

1/ I’ve been building the same game for 2 years. Same prompt. Same concept. Every time a new frontier model drops, I rebuild it from scratch. 5 versions later, the pattern is undeniable. 🧵

2/ FeedShyWorm 1.0 — GPT-3.5 (2023)

Took hours of back-and-forth. Basic Python snake game. Terminal only. Hard-coded grid. I had to hand-hold every function. The AI wrote code like a junior dev copying Stack Overflow. It worked. Barely.

3/ FeedShyWorm 2.0 — Claude 3.5 Sonnet (2024)

90 minutes. Migrated to web. Canvas rendering, keyboard controls, score tracking. The AI understood architecture now — separation of concerns, game loops, state management. Same prompt, 10x the output.

4/ FeedShyWorm 3.0 — Claude 4 (2025)

Minutes, not hours. Full 3D engine. Minecraft-style voxel world. Procedural terrain. Camera controls. Lighting. The AI didn’t just write a game — it wrote a game ENGINE. I barely guided it.

5/ FeedShyWorm 4.0 — Opus 4.5 (2025)

One session. Neural network AI opponents. 4 distinct game modes. Particle effects. Dynamic difficulty scaling. The worm had a BRAIN. The AI reasoned about game design, not just code generation.

6/ FeedShyWorm 5.0 — Opus 4.6 (2026)

One session. Evolutionary simulation. Worms with genetic algorithms. Emergent flocking behavior. Predator-prey dynamics nobody programmed. The AI built a world that surprised its own creator.

7/ Here’s what changed between each version:

1.0 → 2.0: Output volume (more code, faster) 2.0 → 3.0: Dimensionality (2D → 3D) 3.0 → 4.0: Reasoning depth (code → design) 4.0 → 5.0: Emergence (designed → discovered)

8/ The pattern isn’t more features. It’s deeper abstraction.

GPT-3.5 wrote functions. Sonnet wrote systems. Claude 4 wrote engines. Opus 4.5 wrote intelligence. Opus 4.6 wrote evolution.

Each generation moves one layer up the abstraction stack.

9/ Every AI benchmark measures tokens per second or accuracy on standardized tests. Nobody measures what happens when you give the same creative prompt to each generation. The worm game IS the benchmark. Longitudinal, practical, real.

10/ The next version won’t be better code. It’ll be a different CATEGORY of thing. That’s what the curve says. Same worm. Same prompt. Completely different universe each time. That’s not incremental improvement. That’s phase transitions.

Build your own benchmark. Track it across generations. You’ll see it too.