Understanding how multi-layer neural networks learn through the process of backpropagation and the chain rule.
How does a computer 'learn' from its mistakes? Imagine a blindfolded hiker trying to find the lowest point in a valley by only feeling the slope of the ground under their feet—this is exactly how the world's most advanced AI learns.
A Multi-Layer Perceptron (MLP) is the foundational architecture of deep learning. It consists of an Input Layer, one or more Hidden Layers, and an Output Layer. Each layer contains neurons (nodes) connected by weights (). When data flows forward, each neuron calculates a weighted sum of its inputs, adds a bias (), and passes the result through an activation function like ReLU or Sigmoid. This process, called Forward Propagation, results in a prediction. However, the initial prediction is usually wrong. To improve, the network must calculate a Loss Function , which measures the distance between the prediction and the actual target .
Quick Check
In an MLP, what three components are combined at a neuron before being passed to an activation function?
Answer
The weighted sum of inputs, the weights themselves, and a bias term.
Suppose we have a simple network where the loss depends on an output , and depends on a weight . 1. Let and . 2. To find how changes with , we calculate: . 3. and . 4. Therefore, .
Imagine your gradient is and your current weight is . 1. If , your new weight is . A steady, controlled step. 2. If , your new weight is . You have jumped far across the valley floor, potentially missing the bottom entirely.
Quick Check
Why do we subtract the gradient from the weight instead of adding it?
Answer
We subtract because the gradient points in the direction of the steepest increase; to minimize loss, we must move in the opposite direction (downhill).
Consider a weight in the first layer. The loss is calculated at the end of the second layer. To update , you must chain through the entire network: 1. Calculate . 2. Multiply by . 3. Multiply by . 4. This sequence allows information about the error to reach the very beginning of the network.
Which mathematical principle allows backpropagation to calculate gradients across multiple layers?
What happens if the learning rate is set too high?
In backpropagation, weights are updated before the loss is calculated.
Review Tomorrow
In 24 hours, try to write down the weight update formula from memory and explain what each symbol represents.
Practice Activity
Sketch a 3-layer MLP (Input, 1 Hidden, Output) and draw the arrows for both the forward pass (predictions) and the backward pass (gradients).