TINY TRANSFORMER · LIVE TRAINING IN-BROWSER · CHAR-LEVEL LM

init… params: — steps/s: — vocab: —

Loss curve (cross-entropy, per token)

step 0loss —

Block 1 — Wq, Wk, Wv, Wo, FF1, FF2

Block 2 — Wq, Wk, Wv, Wo, FF1, FF2

Attention pattern · last batch (block 1, sequence 0)

row = query positioncolumn = key position (causal mask)

Watch the model evolve from random characters → letter pairs → word fragments → English-like text. Loss starts near ln(vocab) and should drop into the 1.5-2.5 range. Architecture: 2 transformer blocks · single-head attention · 64-dim embed · 256-dim FF · context 64 · Adam optimizer.

Controls

learning rate 3.0e-3

temperature 1.00

batch size 16

Stats

step0

loss (smoothed)—

tokens seen0

elapsed0s