Understanding Policy Gradients: A Simple RL Example

In reinforcement learning (RL), policy gradient algorithms are a powerful way to train agents to make optimal decisions. These algorithms optimize a policy by computing gradients of expected rewards, often using the advantage formulation. In this blog, we’ll walk through a concrete example to demystify policy gradients, focusing on the gradients of the policy probability θπθ(as)\nabla_\theta \pi_\theta(a|s) and the log-probability θlogπθ(as)\nabla_\theta \log \pi_\theta(a|s), and how they drive policy updates. Let’s dive into a simple RL scenario and see these concepts in action!

Who is this Article For?

  • Familiar with Reinforcement Learning Framework.
  • Familiar with Policy Gradient Algorithms
  • Want to learn about Policy Gradient from scratch by hands-on math.

The Scenario: A Robot in a 1D Grid World

Imagine a robot navigating a 1D grid with three states: S1S_1, S2S_2, and S3S_3. The goal is to reach S3S_3, the terminal state, which yields a reward of +1. The robot can take two actions in each state:

ChatGPT Image May 4, 2025, 08_44_50 PM.png

  • Left: Move one state left (e.g., S2S1S_2 \to S_1).
  • Right: Move one state right (e.g., S1S2S_1 \to S_2).

Problem Details:

  • States: {S1,S2,S3}\{ S_1, S_2, S_3 \}, where S3S_3 is terminal.
  • Rewards:
    • Reaching S3S_3: R=+1R = +1.
    • All other transitions: R=0R = 0.
  • Discount Factor: γ=0.9\gamma = 0.9 (future rewards are discounted).
  • Policy: A parameterized policy πθ(as)\pi_\theta(a|s) determines the probability of each action.

Our focus will be on state S1S_1, where we’ll compute the policy gradient and update the policy based on a sampled trajectory.

Steps We Will Follow to Solve this Problem:

  • 1. Defining the Policy: We’ll make a rule (policy function) that tells the robot how likely it is to move Left or Right in each spot. We’ll keep tweaking this rule to help the robot make better choices.
  • 2. Simulating a Trajectory: We’ll let the robot move from one spot to another (like S1S_1 to S2S_2 to S3S_3) and see what steps it takes and what rewards it gets.
  • 3. Computing the Return: We’ll total up the rewards the robot gets during its journey, giving more weight to rewards it gets sooner (since later rewards are less certain).
  • 4. Estimating the Advantage: We’ll figure out if the robot’s moves were better or worse than average, so we can focus on improving the best ones.
  • 5. Computing the Gradients: We’ll look at how much each move changes the robot’s decision rule, so we know how to adjust it.
  • 6. Computing the Policy Gradient: We’ll combine the “how good” and “how to improve” info to decide how to update the robot’s decision rule.
  • 7. Updating the Policy: We’ll tweak the robot’s decision rule a little to make it more likely to pick good moves next time.
  • 8. Try a Different Path: We’ll test what happens if the robot picks a different move (like Left instead of Right) to see how it changes things.

1: Defining the Policy

Let’s define the policy for state S1S_1 using a logistic (sigmoid) function parameterized by a single parameter θ\theta:

πθ(RightS1)=σ(θ)=11+eθ\pi_\theta(\text{Right} | S_1) = \sigma(\theta) = \frac{1}{1 + e^{-\theta}} πθ(LeftS1)=1σ(θ)=eθ1+eθ\pi_\theta(\text{Left} | S_1) = 1 - \sigma(\theta) = \frac{e^{-\theta}}{1 + e^{-\theta}}

Initially we can set θ=0\theta = 0:

πθ(RightS1)=σ(0)=11+e0=0.5\pi_\theta(\text{Right} | S_1) = \sigma(0) = \frac{1}{1 + e^0} = 0.5 πθ(LeftS1)=0.5\pi_\theta(\text{Left} | S_1) = 0.5

The policy is equally likely to choose Left or Right in S1S_1.

2: Simulating a Trajectory

Let’s simulate a trajectory starting from S1S_1:

  1. In S1S_1, the robot chooses Right (probability 0.5) and moves to S2S_2. Reward: R1=0R_1 = 0.
  2. In S2S_2, the robot chooses Right again, moving to S3S_3. Reward: R2=1R_2 = 1.
  3. In S3S_3, the episode ends (terminal state).

Trajectory:

(S1,Right,R1=0,S2,Right,R2=1,S3)(S_1, \text{Right}, R_1 = 0, S_2, \text{Right}, R_2 = 1, S_3)

3: Computing the Return

The return GtG_t is the sum of discounted future rewards from time tt. For the action in S1S_1 at t=0t = 0:

G0=R1+γR2=0+0.91=0.9G_0 = R_1 + \gamma R_2 = 0 + 0.9 \cdot 1 = 0.9

This return represents the total discounted reward for the trajectory.

4: Estimating the Advantage

Policy gradient algorithms often use the advantage formulation:

θJ(πθ)=Eπθ[θlogπθ(as)Aπ(s,a)]\nabla_\theta J(\pi_\theta) = \mathbb{E}_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a|s) \cdot A^\pi(s, a) \right]

The advantage Aπ(s,a)=Qπ(s,a)Vπ(s)A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s) measures how much better an action is compared to the average action in state ss.

Qπ(S1,Right)Q^\pi(S_1, \text{Right}): The expected return for taking Right in S1S_1. From our trajectory, we approximate: Qπ(S1,Right)G0=0.9Q^\pi(S_1, \text{Right}) \approx G_0 = 0.9

Assume we have an estimate:

Vπ(S1)0.5V^\pi(S_1) \approx 0.5

Thus:

Aπ(S1,Right)=0.90.5=0.4A^\pi(S_1, \text{Right}) = 0.9 - 0.5 = 0.4

A positive advantage means Right was better than average in S1S_1.

5: Computing the Gradients

Now, we compute the gradients θπθ(as)\nabla_\theta \pi_\theta(a|s) and θlogπθ(as)\nabla_\theta \log \pi_\theta(a|s).

a) Gradient of the Policy Probability

πθ(RightS1)=σ(θ)\pi_\theta(\text{Right} | S_1) = \sigma(\theta)

The derivative of the sigmoid is:

θσ(θ)=σ(θ)(1σ(θ))\nabla_\theta \sigma(\theta) = \sigma(\theta) (1 - \sigma(\theta))

For θ=0\theta = 0:

σ(0)=0.5\sigma(0) = 0.5

θπθ(RightS1)=0.5(10.5)=0.25\nabla_\theta \pi_\theta(\text{Right} | S_1) = 0.5 \cdot (1 - 0.5) = 0.25

This gradient shows that increasing θ\theta increases the probability of Right by 0.25 per unit change in θ\theta.

b) Gradient of the Log-Probability

Using the chain rule:

θlogπθ(as)=θπθ(as)πθ(as)\nabla_\theta \log \pi_\theta(a|s) = \frac{\nabla_\theta \pi_\theta(a|s)}{\pi_\theta(a|s)}

πθ(RightS1)=0.5\pi_\theta(\text{Right} | S_1) = 0.5

θπθ(RightS1)=0.25\nabla_\theta \pi_\theta(\text{Right} | S_1) = 0.25

θlogπθ(RightS1)=0.250.5=0.5\nabla_\theta \log \pi_\theta(\text{Right} | S_1) = \frac{0.25}{0.5} = 0.5

6: Computing the Policy Gradient

For our trajectory, we approximate the policy gradient:

θJ(πθ)θlogπθ(RightS1)Aπ(S1,Right)\nabla_\theta J(\pi_\theta) \approx \nabla_\theta \log \pi_\theta(\text{Right} | S_1) \cdot A^\pi(S_1, \text{Right})

θJ(πθ)0.50.4=0.2\nabla_\theta J(\pi_\theta) \approx 0.5 \cdot 0.4 = 0.2

This positive gradient indicates we should increase θ\theta to favor Right.

7: Updating the Policy

Update θ\theta using gradient ascent:

θθ+αθJ(πθ)\theta \leftarrow \theta + \alpha \nabla_\theta J(\pi_\theta)

With learning rate α=0.1\alpha = 0.1:

θ=0+0.10.2=0.02\theta = 0 + 0.1 \cdot 0.2 = 0.02

New policy:

πθ(RightS1)=σ(0.02)11+e0.020.505\pi_\theta(\text{Right} | S_1) = \sigma(0.02) \approx \frac{1}{1 + e^{-0.02}} \approx 0.505

πθ(LeftS1)=10.505=0.495\pi_\theta(\text{Left} | S_1) = 1 - 0.505 = 0.495

The probability of Right increases slightly, reflecting its positive advantage.

8: Exploring the Alternative Action

What if the robot chose Left in S1S_1? Suppose Left keeps the robot in S1S_1 with R1=0R_1 = 0, and the episode ends. Then:

  • Return: G0=0G_0 = 0.

  • Advantage: Aπ(S1,Left)00.5=0.5A^\pi(S_1, \text{Left}) \approx 0 - 0.5 = -0.5.

  • Gradients:

    • πθ(LeftS1)=0.5\pi_\theta(\text{Left} | S_1) = 0.5.

    • θπθ(LeftS1)=σ(θ)(1σ(θ))=0.25\nabla_\theta \pi_\theta(\text{Left} | S_1) = -\sigma(\theta)(1 - \sigma(\theta)) = -0.25.

    • θlogπθ(LeftS1)=0.250.5=0.5\nabla_\theta \log \pi_\theta(\text{Left} | S_1) = \frac{-0.25}{0.5} = -0.5.

  • Policy gradient: 0.5(0.5)=0.25-0.5 \cdot (-0.5) = 0.25.

  • Update: θ0+0.10.25=0.025\theta \leftarrow 0 + 0.1 \cdot 0.25 = 0.025, increasing πθ(RightS1)\pi_\theta(\text{Right} | S_1).

This shows the policy learns to avoid Left and favor Right.

Why Use θlogπθ(as)\nabla_\theta \log \pi_\theta(a \mid s)?

You might wonder why we use θlogπθ(as)\nabla_\theta \log \pi_\theta(a|s) instead of θπθ(as)\nabla_\theta \pi_\theta(a|s). The policy gradient theorem naturally derives:

θJ(πθ)=Eπθ[θπθ(as)πθ(as)Aπ(s,a)]\nabla_\theta J(\pi_\theta) = \mathbb{E}_{\pi_\theta} \left[ \frac{\nabla_\theta \pi_\theta(a|s)}{\pi_\theta(a|s)} \cdot A^\pi(s, a) \right]

This is equivalent to using θlogπθ(as)\nabla_\theta \log \pi_\theta(a|s), simplifying computations and aligning with policy distribution sampling.

The log-probability gradient reduces variance and is computationally efficient.

Conclusion

This simple example shows how policy gradients work in practice. In real-world RL, we’d use multiple trajectories, neural network policies, and advanced algorithms like PPO or TRPO, but the core ideas remain the same.