TINY TRANSFORMER · LIVE TRAINING IN-BROWSER · CHAR-LEVEL LM

init… params: steps/s: vocab:

Loss curve (cross-entropy, per token)

step 0loss —

Block 1 — Wq, Wk, Wv, Wo, FF1, FF2

Block 2 — Wq, Wk, Wv, Wo, FF1, FF2

Attention pattern · last batch (block 1, sequence 0)

row = query positioncolumn = key position (causal mask)
Watch the model evolve from random characters → letter pairs → word fragments → English-like text. Loss starts near ln(vocab) and should drop into the 1.5-2.5 range. Architecture: 2 transformer blocks · single-head attention · 64-dim embed · 256-dim FF · context 64 · Adam optimizer.

Controls

3.0e-3
1.00
16

Stats

step0
loss (smoothed)
tokens seen0
elapsed0s

Corpus (paste to replace)

Recent samples (every 50 steps)