An exploration of the modern 'Attention' mechanism and the rise of Large Language Models (LLMs).
How does a machine know that the word 'bank' in 'river bank' is different from 'bank account'? The secret isn't in a dictionary—it's in how the machine 'attends' to the words around it.
Before 2017, AI processed text like a conveyor belt—one word at a time. The Transformer changed this by using Self-Attention, allowing the model to look at every word in a sentence simultaneously. In this framework, every word is assigned three vectors: a Query (), a Key (), and a Value (). To determine how much 'attention' word A should pay to word B, the model calculates the dot product of and . This result is scaled and passed through a softmax function to create a probability distribution. The final output is a weighted sum of the Values (). This allows the model to capture long-range dependencies, like connecting a pronoun at the end of a paragraph to a noun at the very beginning.
Quick Check
In the Self-Attention mechanism, which vector acts like a 'search term' looking for relevant information?
Answer
The Query (Q) vector.
Transformers are modular. The Encoder is designed to 'understand' input by looking at words in both directions (bidirectional). This makes it perfect for tasks like sentiment analysis or named entity recognition (e.g., BERT). The Decoder, however, is autoregressive. It is designed to generate text by looking only at previous words to predict the next one. Modern Large Language Models (LLMs) like GPT-4 are primarily 'Decoder-only' architectures. They use a masking technique to ensure the model doesn't 'cheat' by looking at the future words it is supposed to be predicting during training.
When a Decoder-only model generates the sentence 'The cat sat...', it follows these steps: 1. Input: 'The' -> Predicts 'cat'. 2. Input: 'The cat' -> Predicts 'sat'. 3. The model uses a triangular mask matrix so that when calculating attention for 'cat', the score for 'sat' is set to , effectively making it invisible.
Quick Check
Why would a Decoder-only model be preferred over an Encoder-only model for writing a poem?
Answer
Because Decoders are designed for generative, one-word-at-a-time prediction (autoregression), whereas Encoders are better at analyzing existing text.
LLMs exhibit emergent properties—abilities like coding or logical reasoning that appear only when the model reaches a certain scale of parameters and data. However, they are essentially 'stochastic parrots.' They predict the most statistically likely token (a chunk of text) rather than accessing a database of facts. This leads to hallucinations, where the model generates confident but false information. Furthermore, they are limited by their context window, the maximum number of tokens they can 'remember' at one time. If a conversation exceeds this window, the model 'forgets' the earliest parts of the interaction.
Consider the prompt: 'Who is the President of Mars?' 1. The model has no 'truth' database. 2. It sees 'President of...' and finds high statistical probability for names like 'Elon Musk' or 'John Carter' in its training data (science fiction or news). 3. It outputs a name because it is the most likely linguistic sequence, even though it is factually impossible. This is a hallucination caused by the objective of minimizing 'cross-entropy loss' during training.
What does the 'softmax' function do in the attention mechanism?
Which architecture is 'bidirectional' by nature?
Hallucinations in LLMs occur because the model's internal database has corrupted files.
Review Tomorrow
In 24 hours, try to explain the difference between a Query and a Key to a friend, and why the 'softmax' function is necessary for attention weights.
Practice Activity
Research the 'Context Window' size of GPT-4 versus Claude 3. How many pages of a book could each 'read' at once before forgetting the beginning?