TINY TRANSFORMER · LIVE TRAINING IN-BROWSER · CHAR-LEVEL LM
init…params: —steps/s: —vocab: —
Loss curve (cross-entropy, per token)
step 0loss —
Block 1 — Wq, Wk, Wv, Wo, FF1, FF2
Block 2 — Wq, Wk, Wv, Wo, FF1, FF2
Attention pattern · last batch (block 1, sequence 0)
row = query positioncolumn = key position (causal mask)
Watch the model evolve from random characters → letter pairs → word fragments → English-like text.
Loss starts near ln(vocab) and should drop into the 1.5-2.5 range. Architecture: 2 transformer
blocks · single-head attention · 64-dim embed · 256-dim FF · context 64 · Adam optimizer.