How LLMs Actually Work
Strip away the mystique and a language model is a next-word predictor wrapped around a clever way of moving information. Here's the mental model - enough to reason about why it behaves the way it does.
Foundations · ~10 min read
It predicts the next token. That's the whole job.
Everything an LLM does comes from one trained skill: given some text, predict what comes next. Ask a question and the most likely continuation of “Q: … A:” is the answer. Write half a function and the likely continuation is the rest. Reasoning, translation, and code all emerge from scaling up that single objective. Keep this in mind - most surprising behavior makes sense once you remember the model is always answering “what plausibly comes next?”
Tokens: the units it reads
The model doesn't see words or letters. Text is first chopped into tokens - common chunks that are often word-pieces. “Interpretability” might be several tokens; “the” is one. The model predicts one token at a time, appends it, and repeats. This is why token counts - not word counts - drive cost, speed, and context limits, and why models sometimes miscount letters: they never saw the letters.
Embeddings: turning tokens into meaning
Each token becomes an embedding - a long vector of numbers positioned so that related meanings sit close together. This is the model's first move: convert symbols into a geometric space where “king” and “queen” are neighbors and direction carries meaning. The same idea powers semantic search and the retrieval in RAG systems.
Attention: how tokens share information
The core mechanism of the transformer is attention. At each layer, every token gets to look back at earlier tokens and pull in the information it needs - a pronoun reaching back to find what it refers to, a closing bracket finding its opener. Attention is “routing”: it decides which past information is relevant right now. Many of the interpretable circuits researchers have found, like induction heads, are built out of a few specific attention heads doing specific jobs.
The residual stream: the shared workspace
Information flows through the model along a central residual stream - think of it as a running scratchpad every layer can read from and write to. Attention moves information along it; the feed-forward layers add computed features to it. By the final layer, the stream's state at the last position is read off to produce a probability for every possible next token. The model samples one, and the loop begins again.
Where the knowledge comes from
In pre-training, the model sees a vast amount of text and is tuned, one tiny step at a time, to make its next-token predictions more accurate. No facts are typed in; the weights gradually encode patterns, associations, and skills. Afterward, fine-tuning and techniques like reinforcement learning from human feedback shape it from a raw predictor into a helpful, harder-to-misuse assistant - teaching it to follow instructions and refuse bad requests.
The context window: its working memory
A model can only consider so much text at once - its context window, measured in tokens. Everything in the window is “in mind”; anything pushed out is gone. Overflow the window and the model silently forgets earlier content, which is one of the most common causes of confusing agent behavior. RAG exists largely to feed the right material into this limited window at the right time.
Why it hallucinates
Because the objective is plausibility, not truth. When the model has no solid pattern to draw on, it still produces the most likely-sounding continuation - which can be confidently wrong. Fluency is not accuracy. This is why grounding the model in real sources (retrieval), demanding citations, and running evals matter: they add the truth constraint the base objective never had.
Why this mental model pays off
Once you see the model as “predict the next token, by routing information through a residual stream,” a lot stops being magic. Long prompts degrade because the window fills. The model forgets mid-task because earlier tokens scrolled off. It invents an API because that was the plausible continuation. And the deeper question - what is it actually representing inside? - is exactly where Mechanistic Interpretability picks up.