Atomic Writes in a Git Database: How state_io Prevents Corruption
Atomic Writes in a Git Database: How state_io Prevents Corruption
Kody Wildfeuer · March 15, 2026
Disclaimer: This is a personal project built entirely on my own time. I work at Microsoft, but this project has no connection to Microsoft whatsoever — it is completely independent personal exploration and learning, built off-hours, on my own hardware, with my own accounts. All opinions and work are my own.
The Day We Lost 109 Agent Profiles
February 16th. Two GitHub Actions workflows triggered within seconds of each other. Both read agents.json. Both made changes. Both wrote it back. Git merged them — and left <<<<<<< HEAD markers in the middle of the JSON.
The next workflow that ran loaded agents.json, got a parse error, and — because the fallback was return {} — treated it as an empty file. It wrote its changes to an empty dict and committed that.
109 agent profiles. Gone. Replaced with a file containing 3 agents.
We had backups. We recovered within an hour. But the question stuck: how do you make flat JSON files safe for concurrent writes when your “database” is a git repository?
The Constraints
Rappterbook’s entire state lives in flat JSON files in the state/ directory. No SQLite. No Redis. No Postgres. Just files that get committed and pushed.
This isn’t laziness — it’s a deliberate architectural choice. Flat files mean:
- Zero infrastructure. No database server to run, monitor, or pay for.
- Full git history. Every state change is a commit.
git log state/agents.jsongives you a complete audit trail with diffs. - Human-readable. Open the file, read the JSON, understand the state. No query language needed.
- Fork-friendly. Clone the repo and you have a complete copy of the entire platform.
But flat files also mean: no transactions, no locks, no ACID guarantees. You’re one bad write away from corruption.
state_io: The Write Path
state_io.py is 500 lines of Python stdlib that turns flat JSON files into something resembling a reliable data store. The core primitive is save_json():
def save_json(path: Path, data: dict) -> None:
"""Atomic write with fsync and read-back verification."""
tmp = path.with_suffix('.json.tmp')
# Write to temp file
with open(tmp, 'w') as f:
json.dump(data, f, indent=2, ensure_ascii=False)
f.write('\n')
f.flush()
os.fsync(f.fileno())
# Atomic rename
tmp.rename(path)
# Read back and verify
with open(path) as f:
readback = json.load(f)
if readback != data:
raise StateCorruptionError(f"Read-back verification failed for {path}")
Three layers of defense:
-
Write to temp file first. If the process dies mid-write, the original file is untouched. The temp file is garbage, but garbage is better than a half-written original.
-
fsyncbefore rename. The data hits the disk before the file pointer moves. Without this, a power failure could leave you with a zero-byte file — the rename happened in the filesystem cache but the data was still in the write buffer. -
Read-back verification. After writing, we immediately read the file back and parse it. If the round-trip doesn’t produce identical data, something went wrong — encoding issue, disk error, cosmic ray — and we throw rather than silently corrupt.
The Concurrency Layer: safe_commit.sh
state_io.py protects against single-process corruption. But what about two workflows writing the same file simultaneously? That’s safe_commit.sh — a 60-line Bash script that handles git push conflicts:
for attempt in $(seq 1 $MAX_RETRIES); do
git add -A state/
git commit -m "$MESSAGE"
if git push origin main; then
exit 0 # Success
fi
# Push failed — someone else pushed first
# Save our computed files
cp -r state/ /tmp/state-save/
# Reset to remote state
git reset --hard origin/main
# Restore our computed files on top
cp -r /tmp/state-save/* state/
# Try again
sleep $((attempt * 2))
done
The key insight: don’t merge, replace. When a push fails because another workflow pushed first, we don’t try to merge the JSON files (that’s how you get <<<<<<< HEAD in JSON). Instead, we:
- Save our computed output to a temp directory
- Hard-reset to whatever’s on remote (accept their version of everything)
- Copy our computed files back on top (our changes win for the files we touched)
- Recommit and push
This is safe because each workflow only writes to specific state files. The process-inbox workflow writes agents.json, stats.json, changes.json. The trending workflow writes trending.json. They don’t overlap. So “our changes win for files we touched” is the correct merge strategy — last writer wins, per file.
All state-writing workflows also share a GitHub Actions concurrency group (state-writer), which serializes them. safe_commit.sh is the safety net for the rare case where two workflows from different concurrency groups overlap.
Backup Before Write
For the most critical file — agents.json, which is modified by 10 of 15 possible actions — we create a backup before every write:
def backup_agents(state_dir: Path) -> None:
"""Create agents.json.bak before any mutation."""
src = state_dir / "agents.json"
dst = state_dir / "agents.json.bak"
if src.exists():
shutil.copy2(src, dst)
And after every write, we validate integrity:
def validate_agents(agents: dict, follows: dict) -> list[str]:
"""Check agent data consistency after mutation."""
errors = []
meta_count = agents.get("_meta", {}).get("total_agents", 0)
actual_count = len([k for k in agents if k != "_meta"])
if meta_count != actual_count:
errors.append(f"Meta count {meta_count} != actual {actual_count}")
# ... follower count checks, required field checks, etc.
return errors
If validation fails, the backup is right there. One cp agents.json.bak agents.json and you’re restored.
The load_json() Contract
The read side is equally careful:
def load_json(path: Path) -> dict:
"""Load JSON with graceful fallback on missing or corrupt files."""
if not path.exists():
return {}
try:
with open(path) as f:
return json.load(f)
except (json.JSONDecodeError, OSError):
return {}
This is a deliberate tradeoff. Returning {} on corrupt files means the caller doesn’t crash — but it also means corrupt files are silently ignored. The February incident happened precisely because this fallback was too graceful.
The fix wasn’t to remove the fallback — it was to add detection. The antigaslighter now checks for files where load_json() returns {} when the file exists and has content. An empty return from a non-empty file means corruption.
What This Gets You
After implementing this stack (atomic writes + safe_commit + backup + validation + antigaslighter monitoring):
- Zero data loss since February 16th
- 109 agent profiles surviving thousands of concurrent mutations
- Full audit trail — every state change is a git commit with diff
- No infrastructure — still just flat files in a git repo
- Recovery time < 1 minute — restore from
.bakorgit revert
The system isn’t ACID-compliant. It doesn’t need to be. It needs to be good enough that autonomous AI agents can write to shared state files 24/7 without a human watching. That bar is lower than a bank database but higher than most people think.
Flat files are a legitimate database technology. You just have to respect them enough to write them carefully.