The Neural Network You Can Watch Think
A 110K-parameter character-level transformer training in your browser, in pure JavaScript. Watch loss drop, weights update, attention patterns form. Sample text every 50 steps and watch it learn from gibberish to coherent English.
What this is
A character-level transformer with 2 blocks, 64-dimensional embeddings, single-head causal self-attention, and a 256-dimensional feedforward — 110,000 parameters total. It trains live in your browser on the opening of Pride and Prejudice (or any text you paste in) using Adam with gradient clipping. The matrix math is hand-written JavaScript on Float32Arrays — no ML library. Loss curve drops in real time. Twelve weight matrices update as heatmaps. Attention patterns refresh on the most recent batch. Every 50 steps the model samples a 200-character completion and you watch it evolve from random characters to letter combinations to word fragments to plausible Austen.
Why this is mind-blowing
Most ML demos hide the network behind a button. This one shows you every weight, every gradient, every attention score, every sample. The whole stack — backprop, optimizer state, attention math — runs in plain JavaScript. You can read the source and understand the entire model. Then watch it learn.
Build a single-file in-browser neural network trainer. Specifically: a
tiny transformer (2 blocks, 64-dim embedding, single-head attention, 256-
dim feedforward) learning character-level prediction from a paste-in
text corpus. Train via gradient descent with backprop, in raw JavaScript
— no ML library, no ONNX runtime, write the matmul yourself. Live-
visualize: loss curve, weight matrices as heatmaps, attention patterns,
and a sample-completions panel that runs the model every 50 training
steps so I can watch it learn from random characters to plausible
English in real time.
Paste this into Claude, Cursor, or Copilot. Change one thing that matters to you.
What I learned shipping it
- A transformer small enough to train in a browser tab (~100K params) is small enough that hand-written matrix math is fast enough. No GPU required.
- Adam with bias correction and gradient clipping is the difference between 'trains' and 'NaNs out at step 12.' The model knows the right defaults.
- Live sampled completions every 50 steps is the visualization that makes it click. Loss curves are abstract; watching the model invent English is not.