Skip to main content
Explainers · Transformers
A Visual Primer

How a transformer
actually thinks.

Eight stages. The whole loop, one screen at a time.

Scroll
01

Text becomes tokens.

“The barrister cross-examined unhappily.”
Thebarristercross-examinedunhappily.
02

Each token is a point in space.

dim 1dim 2judgebarristersolicitoradvocatelegal rolescourttribunalchambersvenuesdogcathorseanimals~4,096 dimensions · similar meanings sit close together
03

Every token looks at every previous token.

The
barrister
opened
her
brief
because
she
needed
notes
The
barrister
opened
her
brief
because
she
needed
notes
Brighter cells = stronger attention. “her” looks back at “barrister”. “she” too. The model learns these links from data alone.
04

It does this many times in parallel.

Head 1
previous token
Head 2
subject of clause
Head 3
coreference
Head 4
punctuation
05

Then each token is processed independently.

in
expand 4×
out

Where most of the model’s knowledge is stored.

06

All of that, stacked 80 times.

L1
attention
feed-forward
L2
attention
feed-forward
L3
attention
feed-forward
· · ·
L78
attention
feed-forward
L79
attention
feed-forward
L80
attention
feed-forward
Each block reads from and writes to a shared residual stream.
07

Output is a probability over every word.

“The barrister opened her brief and began to read the
papers
21%
first
14%
witness
11%
opening
9%
judgment
7%
transcript
5%
~100k more
33%
08

Pick one. Append. Repeat.

The barrister opened her brief and
0 tokens

That is the whole machine.

No memory between steps. No reasoning beyond the loop. Just attention, projection, and a probability over the next token — at remarkable scale.

Talk to us about AI training