Learning how AI handles sequential data like text and speech using Recurrent Neural Networks.
How does your smartphone predict the next word in your text message before you even type it? It isn't reading your mind; it's using a mathematical 'memory' to understand the sequence of your thoughts.
Computers cannot process raw text like 'Hello World.' To bridge this gap, we use Tokenization, the process of breaking text into smaller units (tokens) and mapping them to numbers. First, we build a Vocabulary (), which is a unique list of all words in our dataset. Each word is then assigned a unique integer index. However, since numbers like 1 and 2 imply a mathematical relationship (2 is greater than 1), we often use One-Hot Encoding. In this system, each word is represented by a vector of size containing all zeros except for a single '1' at the word's specific index. This ensures the model treats every word as a distinct category without unintended numerical bias.
Let's encode the sentence: 'AI is cool.'
1. Build Vocabulary: 2. Assign Indices: AI=0, is=1, cool=2. 3. Create Vectors: - 'AI' - 'is' - 'cool'
Each vector has a length equal to the total number of unique words in the vocabulary.
Quick Check
If our vocabulary has 10,000 unique words, what is the length of the one-hot encoded vector for any single word?
Answer
10,000. Each vector must have a slot for every possible word in the vocabulary.
Standard neural networks are 'feed-forward,' meaning data moves in one direction. Recurrent Neural Networks (RNNs) are different because they possess a Hidden State (), which acts as a short-term memory. As the network processes a sequence, it passes the information from the previous step into the current step. The formula for the hidden state at time is:
Where is the current input, is the previous memory, represents weight matrices, and is an activation function like tanh. This allows the network to maintain context, such as remembering the subject of a sentence to predict the correct verb later.
Imagine predicting the next word in 'The cat sat...'
1. Step 1: Input 'The'. The RNN initializes (usually zeros) and calculates based on 'The'. 2. Step 2: Input 'cat'. The RNN calculates using the memory of 'The' () plus the new word 'cat'. 3. Step 3: Input 'sat'. The RNN calculates using . 4. Output: The network uses to predict the most likely next word, such as 'on'.
Quick Check
What specific component of the RNN allows it to keep track of information from previous words in a sentence?
Answer
The Hidden State (or the recurrent connection).
While RNNs are powerful, they have a major flaw: they are 'forgetful' over long distances. This is known as the Vanishing Gradient Problem. During training, we use Backpropagation Through Time (BPTT) to update weights. Because the math involves multiplying many small derivatives (gradients) together across every time step, the gradient often shrinks toward zero. If a sentence is 50 words long, the influence of the 1st word on the 50th word becomes mathematically microscopic. Consequently, the model cannot learn long-term dependencies, such as matching a plural subject at the start of a paragraph with a verb at the very end.
Consider this long-range dependency:
'The keys that I left on the kitchen counter next to the fruit bowl and the spare change were missing.'
1. To predict 'were' (plural) instead of 'was' (singular), the model must remember the word 'keys' from 15 words ago. 2. In a standard RNN, the gradient signal from 'were' must travel back through 15 layers of multiplication. 3. If each gradient is , the total signal becomes . The model 'forgets' that the subject was plural because the signal is too weak to update the weights effectively.
Why is One-Hot Encoding preferred over simple integer encoding (1, 2, 3...) for word categories?
In an RNN, the hidden state is calculated using only the current input .
What happens to the gradient in a standard RNN as it is backpropagated through many time steps?
Review Tomorrow
In 24 hours, try to sketch the RNN 'loop' diagram and write down the formula for the hidden state from memory.
Practice Activity
Look up 'Long Short-Term Memory (LSTM)' networks. How do they use 'gates' to solve the vanishing gradient problem we discussed today?