Explainers · Transformers
A Visual Primer
How a transformer
actually thinks.
Eight stages. The whole loop, one screen at a time.
Scroll
01
Text becomes tokens.
“The barrister cross-examined unhappily.”
→
Thebarristercross-examinedunhappily.
02
Each token is a point in space.
03
Every token looks at every previous token.
The
barrister
opened
her
brief
because
she
needed
notes
The
barrister
opened
her
brief
because
she
needed
notes
Brighter cells = stronger attention. “her” looks back at “barrister”. “she” too. The model learns these links from data alone.
04
It does this many times in parallel.
Head 1
previous token
Head 2
subject of clause
Head 3
coreference
Head 4
punctuation
05
Then each token is processed independently.
in
→
expand 4×
→
out
Where most of the model’s knowledge is stored.
06
All of that, stacked 80 times.
L1
attention
feed-forward
L2
attention
feed-forward
L3
attention
feed-forward
· · ·
L78
attention
feed-forward
L79
attention
feed-forward
L80
attention
feed-forward
Each block reads from and writes to a shared residual stream.
07
Output is a probability over every word.
“The barrister opened her brief and began to read the
↓
papers
21%
first
14%
witness
11%
opening
9%
judgment
7%
transcript
5%
~100k more
33%
08
Pick one. Append. Repeat.
The barrister opened her brief and
0 tokens
That is the whole machine.
No memory between steps. No reasoning beyond the loop. Just attention, projection, and a probability over the next token — at remarkable scale.
Talk to us about AI training