Understanding how agents learn to make decisions through rewards and penalties in an environment.
How does a computer program beat the world champion at Go or learn to play video games from scratch? It doesn't follow a pre-written script; it learns through the high-stakes game of trial, error, and rewards.
At its core, Reinforcement Learning (RL) is about an agent interacting with an environment. This interaction is formalized as a Markov Decision Process (MDP). In this framework, the agent observes the current State (), performs an Action (), and receives a Reward () while transitioning to a new state (). The 'Markov' property assumes that the future depends only on the current state, not the entire history of moves. The ultimate goal of the agent is to learn a Policy (), which is a strategy or mapping that tells the agent which action to take in any given state to maximize the cumulative reward over time, often called the Return ().
Imagine a robot on a grid trying to reach a charging station. 1. **State ():** The robot's current coordinates, e.g., . 2. **Action ():** Move Up, Down, Left, or Right. 3. **Reward ():** for reaching the charger, for hitting a wall, and for every step taken (to encourage speed).
Quick Check
In an MDP, what do we call the strategy that dictates which action an agent should take in a specific state?
Answer
The Policy (denoted by the Greek letter ).
One of the toughest challenges in RL is the Exploration vs. Exploitation trade-off. Exploitation means the agent chooses the action it knows yields the highest reward based on current data. Exploration means the agent tries a random or unknown action to see if it leads to a better long-term outcome. If an agent only exploits, it might get stuck in a 'local optimum'—a decent path that isn't actually the best one. To balance this, we often use an **-greedy strategy**, where the agent explores with probability and exploits with probability .
Think of choosing a place for dinner: 1. Exploitation: You go to your favorite pizza place because you know it's an experience. 2. Exploration: You try the new sushi place down the street. It could be a (bad reward) or a (better than your current best). 3. The Trade-off: If you never explore, you'll never find the restaurant.
Quick Check
Why is pure exploitation risky for an AI agent?
Answer
The agent might miss out on much higher rewards because it stops searching for better paths once it finds a functional one.
Suppose an agent is in State A, takes an action, and receives a reward . It lands in State B. 1. Current . 2. The best possible future value in State B is . 3. Let learning rate and discount . 4. New Value: . The agent's 'knowledge' of that action just improved!
Which component of an MDP represents the 'strategy' the agent follows?
In the -greedy strategy, if , how often does the agent explore?
A higher discount factor ( closer to 1) makes an agent more 'short-sighted,' focusing only on immediate rewards.
Review Tomorrow
In 24 hours, try to sketch the MDP loop (Agent -> Action -> Environment -> State/Reward) from memory and explain the difference between and .
Practice Activity
Research the 'Frozen Lake' environment in OpenAI Gym to see how Q-learning is implemented in Python code.