What an LLM Actually Is
The most common misconception about large language models is that they are brains. They are not. They are prediction engines — mathematical systems trained to guess the next token given everything that came before.
This is not a metaphor. When an LLM generates text, it performs statistical optimization on patterns it learned during training. It has no subjective experience. It has no desires or intentions. It is not aware of its own outputs. The fluency of generated language creates a powerful illusion of understanding, but this is a structural artifact of next-token prediction, not evidence of consciousness or agency.
The LLM-as-brain paradigm — treating these models as autonomous agents responsible for memory, reasoning, and decision-making — is an architectural misapplication. It assigns to the model responsibilities it was never designed to handle. Understanding what an LLM actually is requires tracing the actual mechanism: how text becomes numbers, how numbers acquire meaning through position in high-dimensional space, how attention computes relationships, and how training shapes every behavior the model exhibits.
This article explains those mechanisms step by step. It is the foundation for understanding why orchestration frameworks that respect the prediction-engine nature of LLMs produce more reliable, auditable, and cost-controlled systems than those that treat models as brains. If you are evaluating AI infrastructure for enterprise deployment, what follows is the vocabulary you need to reason precisely about these systems.
Tokenization — How Text Becomes Numbers
Before an LLM can process text, that text must be converted into numbers. The mechanism is called tokenization, and the dominant algorithm in modern LLMs is Byte-Pair Encoding (BPE).
BPE originated as a text compression algorithm developed by Gage in 1994. Researchers adapted it for natural language processing in 2016 when Sennrich and colleagues showed it could handle rare and out-of-vocabulary words more effectively than previous approaches. The algorithm is straightforward: starting with individual characters, it iteratively merges the most frequent adjacent token pair, building a vocabulary optimized for the training data.
BPE is a compression algorithm. It has no semantic understanding. It identifies recurring byte sequences and merges them based on frequency, not meaning. A token might represent a common word like “the”, a prefix like “un-”, a suffix like “-ing”, or a single byte depending on the vocabulary the algorithm constructs.
Vocabulary sizes vary across models. GPT-2 uses 50,257 tokens. GPT-4 uses approximately 100,000. LLaMA uses 32,000. Research on vocabulary scaling shows BPE plateaus around 50,000 tokens with marginal gains beyond that size, though frontier models continue pushing into larger vocabularies.
When you input text into an LLM, it passes through these stages: the text is first normalized (often lowercased with punctuation handled consistently), then segmented into tokens according to the learned vocabulary, then converted to integer token IDs, then passed to the embedding layer. The model never sees raw text. It works exclusively with these numerical codes.
The implications are important. Two semantically different inputs can produce identical token sequences if those inputs share surface patterns. A question about cats and a question about the cognitive processes of felines both pass through the same tokenization pipeline, but only one of them matches patterns the model learned to associate with fluent completions. Tokenization is the first filter between meaning and computation.
Token Embeddings — Meaning as Coordinates
Token IDs alone tell the model nothing about meaning. They are arbitrary labels. The embedding layer converts these discrete integers into dense floating-point vectors — learned representations where semantic relationships emerge from positional geometry.
The embedding layer is a lookup table of shape (V, d_model), where V is vocabulary size and d_model is the dimensionality of the model’s internal representation. Each token ID maps to a d_model-dimensional vector that the model learns to position meaningfully during training.
The spatial relationships in this high-dimensional space encode semantics. The canonical example: king - man + woman ≈ queen. The vector for “king” minus the vector for “man” plus the vector for “woman” produces a vector close to “queen” in embedding space. This is not a linguistic rule — it is a structural property of how the model organizes relationships during training. The model learned that “man” and “king” appear in similar contexts, that “woman” and “queen” appear in similar contexts, and that “king” minus “man” captures the gender dimension while preserving the royal status dimension.
Modern LLM embeddings are contextualized, not static. Word2Vec, an earlier embedding technique, produced the same vector for a word regardless of context — “bank” as a financial institution and “bank” as a river side received identical embeddings. Transformer-based models produce different vectors for the same token depending on surrounding context. The embedding for “bank” in “I deposited money at the bank” differs from the embedding in “I sat on the bank of the river” because attention has already mixed in contextual information from surrounding tokens.
Embeddings are learned, not engineered. No human specifies that “king” should be positioned near “queen” or that gender should be a computable dimension. The model discovers these relationships through gradient descent on next-token prediction, positioning vectors so that predicting the next token given context becomes increasingly accurate. Meaning emerges from the optimization pressure of the training objective.
Attention — The Heart of Transformers
Attention is the mechanism that enables each token to contextually influence every other token in a sequence. It is the heart of the transformer architecture and the source of its expressive power.
The scaled dot-product attention mechanism works with three learned projections per token: Query (Q), Key (K), and Value (V). The query projection represents what information this token is seeking. The key projection represents what information this token offers to others. The value projection represents the actual content this token should contribute to the output.
Attention computes compatibility between queries and keys through dot product: each query is dotted against all keys to produce a score reflecting how much each token should attend to every other token. These scores are scaled by the square root of the key dimensionality to prevent vanishing gradients when dimensionality is large. A softmax normalizes these scores into a probability distribution. Finally, each value is weighted by these attention weights and summed.
The result: every token becomes a weighted blend of values from all tokens in the sequence, where the weights reflect computed relevance. A token predicting the next word in “The cat sat on the” attends strongly to “cat” and “sat” because those are the most relevant preceding tokens for predicting the subject and verb. A token in a translation task attends to tokens in the source language that inform the target-language prediction.
Multi-head attention extends this by running multiple attention mechanisms in parallel, each with its own Q/K/V projections. Different heads learn different relationship types. Some heads track syntactic dependencies (subject-verb agreement). Some track coreference (what “it” refers to). Some track semantic relationships. BERT analysis by Clark et al. found distinct heads dedicated to these functions, confirming that the model distributes representational load across the attention mechanism.
The O(n²) barrier. Self-attention has quadratic time and space complexity in sequence length: every token attends to every other token. This is not a soft limitation — it is a proven lower bound. Research on the computational complexity of self-attention establishes that unless a fundamental complexity-theoretic assumption fails (SETH), no algorithm can compute exact attention in better than O(n²) time. This quadratic barrier is why KV caching is essential for inference optimization, and why frameworks that push millions of tokens in context windows face inherent cost scaling challenges.
Attention is mechanical computation. It is not comprehension. When tokens attend to each other, they perform matrix multiplications, not understanding meaning. The model produces fluent continuations not because it grasps what it is saying, but because attention has learned statistical regularities about how tokens pattern together in context.
The Transformer Architecture
The transformer architecture chains attention with feed-forward networks, residual connections, and layer normalization into deep stacks that learn hierarchical representations.
Most modern LLMs use a decoder-only architecture. This is the design behind GPT, LLaMA, and Claude. The decoder processes the input sequence left-to-right, masking attention so each token can only attend to preceding tokens, preventing the model from “seeing the answer” when predicting what comes next. Encoder-decoder transformers use bidirectional attention for both input and output, but decoder-only pre-training on raw text without paired input-output data proved simpler to scale and produced stronger zero-shot abilities.
Each transformer block contains two primary components. The first is multi-head self-attention. The second is a feed-forward network (FFN) — a two-layer linear transformation with a non-linear activation function between them. Modern LLMs typically use GELU (Gaussian Error Linear Unit) or SwiGLU (a gated linear unit variant) as the activation function. GELU adapts weights by value rather than binary gating like ReLU, producing smoother gradients. SwiGLU adds a gating mechanism and is used in PaLM, LLaMA-2, and LLaMA-3, providing 1-2 perplexity improvement over GELU.
Residual connections enable gradient flow through depth. The output of each sublayer is added to its input before normalization, creating a “skip path” that gradients can travel through directly. This means a network can learn to skip a layer if it is not useful. Without residual connections, deep networks suffer from vanishing gradients and information degradation. With them, transformers can stack dozens of layers while maintaining training stability.
Layer normalization stabilizes training by normalizing activations across features rather than across the batch dimension. It computes the mean and standard deviation of all features for each token independently, then scales and shifts with learned parameters. LayerNorm works with variable sequence lengths and does not require synchronization across devices, making it suitable for the variable-length sequences that LLMs process.
The architectural insight is this: depth in transformers is enabled by stability mechanisms, not by cognitive structure. The model builds hierarchical representations — low-level patterns in early layers, abstract relationships in later layers — through the repeated application of the same computational motifs. There is no reasoning, no planning, no internal monologue — only gradient descent optimizing next-token prediction across stacked attention and feed-forward computations.
Training — What the LLM Is Actually Optimized For
Every behavior an LLM exhibits traces to a single training objective: cross-entropy loss minimization on next-token prediction. Understanding this objective is the key to understanding why models behave as they do.
The model is trained on billions of text sequences from books, articles, websites, and code repositories. For each sequence, the model receives all preceding tokens and must predict the next token. The loss function is cross-entropy: for each position in the training sequence, the model computes a probability distribution over the vocabulary, and the loss is the negative log probability of the actual next token. Gradient descent adjusts the model’s weights to minimize this loss across the entire training corpus.
Every capability the model displays — fluent text generation, code completion, question answering, translation — emerges because these behaviors reduced next-token prediction error during training. The model never “learns to reason” as a separate objective. It learns whatever patterns in the training data are most useful for predicting the next token. When those patterns happen to include logical relationships, the model reproduces them. When they include factual associations, the model carries them. When they include stylistic patterns, the model imitates them.
Temperature is a sampling parameter that modifies the softmax distribution at inference time. T > 1 flattens the distribution, increasing the probability of lower-ranked tokens and producing more diverse, creative outputs. T < 1 sharpens the distribution, concentrating probability on high-ranked tokens and producing more focused, deterministic outputs. T = 0 produces greedy decoding — always selecting the highest-probability token. Temperature does not change what the model learned; it changes the sampling strategy that determines which learned pattern is expressed.
The model itself is deterministic. Given identical weights and identical input, identical logits are produced every time. Non-determinism enters only through the sampling process. Enterprise systems that require reproducibility set temperature to zero and ensure input tokenization is deterministic.
In-Context Learning — The Illusion of Reasoning
One of the most striking capabilities of large language models is in-context learning (ICL): the ability to perform new tasks from examples included in the prompt, without any weight updates. Show the model three examples of translation, and it translates new sentences. Show it few-shot examples of a task, and it performs that task.
This ability emerges at scale. It is absent in smaller models and appears discontinuously as model size increases — a phenomenon documented by researchers as “emergent abilities.” However, recent work suggests that emergence may partly be an artifact of nonlinear evaluation metrics: with proper metrics, improvements appear more continuous.
The mechanism behind in-context learning is not yet fully understood, but research has revealed something remarkable: transformers implement implicit gradient descent through their attention mechanism. When a model processes in-context examples, attention computes meta-gradients — weight updates that the model would apply to itself if it were actually training on those examples. These meta-gradients are then applied to the model’s own parameters, producing a temporary weight update that enables the model to match the pattern demonstrated in the prompt.
The research from Dai et al. and von Oswald et al. formalizes this finding: transformer attention has a dual form of gradient descent. The model effectively runs a gradient descent step on itself through the attention mechanism, using the in-context examples as a training signal. This is not a metaphor — it is a mathematical characterization of what the attention computation actually does.
The implications are significant. When an LLM appears to reason through a problem — showing its work, trying alternative approaches, backtracking — it is reproducing patterns from its training distribution that include worked examples, debugging traces, and explanatory prose. The model has learned that certain surface patterns (showing steps, considering alternatives) correlate with correct answers in training data. It reproduces these patterns because they reduced prediction error, not because it understands logical reasoning.
“Reasoning” is the right word for this process only with heavy qualification. The model performs pattern matching, not logical inference. When it generalizes — applying a skill to a new domain — it is because the pattern was general enough to cover the new case in the training distribution, not because it has abstracted the underlying rule.
Why This Matters — The Architectural Truth
Here is the full argument in one chain.
An LLM is a prediction engine that produces the next token given everything that came before. It is not sentient. It does not understand. It has no goals, no beliefs, no intentions. Everything it generates traces to minimizing cross-entropy loss on next-token prediction during training.
Text enters the model as token IDs. Tokens map through learned embeddings to positions in high-dimensional space where semantic relationships are encoded geometrically. Attention computes mechanical dot products between queries, keys, and values. The transformer architecture stacks these attention computations with feed-forward networks, residual connections, and layer normalization for training stability. Depth is enabled by stability mechanisms, not by cognitive structure.
Training shapes every behavior by minimizing prediction error across billions of text sequences. Temperature and sampling introduce controlled non-determinism at inference time, but the model itself is deterministic. In-context learning produces the appearance of reasoning through implicit gradient descent, but this is mechanical computation, not cognition.
The architectural truth: LLMs are optimization artifacts. They were shaped by gradient descent to predict next tokens. Everything they can do, they can do because doing so reduced next-token prediction error. Everything they cannot do, they cannot do because that capability did not reduce prediction error in their training distribution.
Understanding this truth has practical consequences for anyone building AI systems. When you know that an LLM is a prediction engine — not a brain, not an agent, not a system with its own intentions — you can reason clearly about where these systems succeed, where they fail, and why frameworks that respect their nature produce better results than those that misapply them.
Appendix: Key Equations
A.1 — Scaled Dot-Product Attention
Attention(Q, K, V) = softmax(QK^T / √d_k) V
The scaling factor √d_k prevents softmax saturation when the key dimensionality d_k is large, which would otherwise cause vanishing gradients during training.
A.2 — Cross-Entropy Loss for Language Modeling
L = -Σ_t log P(token_t | token_1, ..., token_{t-1})
The model computes the negative log probability of the actual next token at each position in the training sequence, summing across all positions. Gradient descent minimizes this loss, which directly optimizes next-token prediction accuracy.
A.3 — Layer Normalization
LN(x) = γ · (x - μ) / √(σ² + ε) + β
LayerNorm normalizes activations across features, computing mean μ and variance σ² for each token independently. The learned scale γ and shift β parameters allow the network to preserve representational capacity after normalization. The ε term prevents division by zero.