Explainers · Transformers

A Visual Primer

How a transformer
actually thinks.

Eight stages. The whole loop, one screen at a time.

Scroll

Text becomes tokens.

“The barrister cross-examined unhappily.”

→

Thebarristercross-examinedunhappily.

Each token is a point in space.

Every token looks at every previous token.

The

barrister

opened

her

brief

because

she

needed

notes

The

barrister

opened

her

brief

because

she

needed

notes

Brighter cells = stronger attention. “her” looks back at “barrister”. “she” too. The model learns these links from data alone.

It does this many times in parallel.

Head 1

previous token

Head 2

subject of clause

Head 3

coreference

Head 4

punctuation

Then each token is processed independently.

→

expand 4×

→

out

Where most of the model’s knowledge is stored.

All of that, stacked 80 times.

attention

feed-forward

attention

feed-forward

attention

feed-forward

· · ·

L78

attention

feed-forward

L79

attention

feed-forward

L80

attention

feed-forward

Each block reads from and writes to a shared residual stream.

Output is a probability over every word.

“The barrister opened her brief and began to read the

↓

papers

21%

first

14%

witness

11%

opening

judgment

transcript

~100k more

33%

Pick one. Append. Repeat.

The barrister opened her brief and

0 tokens

That is the whole machine.

No memory between steps. No reasoning beyond the loop. Just attention, projection, and a probability over the next token — at remarkable scale.

Talk to us about AI training

How a transformeractually thinks.

Text becomes tokens.

Each token is a point in space.

Every token looks at every previous token.

It does this many times in parallel.

Then each token is processed independently.

All of that, stacked 80 times.

Output is a probability over every word.

Pick one. Append. Repeat.

That is the whole machine.

How a transformer
actually thinks.