Reinforcement Learning with Code [Chapter 4. Value Iteration and Policy Iteration]
Reinforcement Learning with Code
This note records how the author begin to learn RL. Both theoretical understanding and code practice are presented. Many material are referenced such as ZhaoShiyu’s Mathematical Foundation of Reinforcement Learning, .
Chapter 4. Value Iteration and Policy Iteration
Value iteration and policy iteration have a common name called dynamic programming. Dynamic programming is model-based algorithm, which is the simplest RL algorithm. Its helpful to us to understand the model-free algorithm.
4.1 Value iteration
Value iteration is solving the Bellman optimal equation directly.
Matrix-vector form:
The value iteration is exactly the algorithm suggested by the contraction mapping in chapter 3. Value iteration concludes two parts. First, in every iteration is called policy update.
$$
\pi_{k+1} = \arg \textcolor{red}{\max_{\pi_k}(r_{\pi_k} + \gamma P_\pi v_k)}
$$
Second, the step is value update. Mathematically, it is to substitute $\pi_{k+1}$ and do the following operation:
$$
v_{k+1} = r_{\pi_{k+1}} + \gamma P_{\pi_{k+1}}v_k
$$
The above algorithm is matrix-vector form which is useful to understand the core idea. The above equation is iterative which means we calculate $v_{k+1}$ is only one step calculation.
There is a misunderstanding that we doesn’t use Bellman equation to calculate state value $v_{k+1}$ directly. Instead, when we use greedy policy update strategy the state value $v_{k+1}$ is actually the maximum action value $\max_a q_k(s,a)$.
The value iteration includes solving the Bellman optimal equation part as show in red.
Elementwise form:
First, the elementwise form policy update is
$$
\begin{aligned}
\pi_{k+1} & = \arg \max_{\pi_k}(r_{\pi_k} + \gamma P_\pi v_k) \quad \text{(matrx-vector form)}\newline
\to \pi_{k+1}(s) & = \arg \max_{\pi_k} \sum_a \pi_k(a|s)
\Big(\sum_r p(r|s,a)r + \sum_{s^\prime} p(s^\prime|s,a) v_k(s^\prime) \Big) \quad \text{(elementwise form)}\newline
\to \pi_{k+1}(s) & = \arg \max_{\pi_k} \sum_a \pi_k(a|s) q_{\pi_k}(s,a) \newline
\to \pi_{k+1}(s) & = \sum_a \pi_k(a|s) \arg \max_{a_k}q_{\pi_k}(s,a)
\end{aligned}
$$
Then use the greedy policy update algorithm, when $a=a_k^*(s)$, $\pi_{k+1}(a|s) = 1$, when $a\ne a_k^*(s)$, $\pi_{k+1}(a|s) = 0$, where $a_k^*(s) = \arg\max_a q_k(a,s)$.
Second, the elementwise form value update is
$$
\begin{aligned}
v_{k+1} & = r_{\pi_{k+1}} + \gamma P_{\pi_{k+1}}v_k \quad \text{(matrx-vector form)}\newline
\to v_{k+1}(s) & = \sum_a \pi_{k+1}(a|s)
\underbrace{
\Big(\sum_r p(r|s,a)r + \gamma \sum_{s^\prime}p(s^\prime|s,a)v_k(s^\prime) \Big)}_{q_k(s,a)} \quad \text{(elementwise form)}
\end{aligned}
$$
Since $\pi_{k+1}$ is a greedy policy, the above equation is simply
$$
v_{k+1}(s) = \max_a q_k(s,a)
$$
Pesudocode:

4.2 Policy iteration
The policy iteration is not an algorithm directly solving the Bellman optimality equation. However, it has an intimate relationship to value iteration. The policy iteration includes two parts policy evaluation and policy improvement.
Policy evaluation:
The first step is policy evaluation. Mathematically, it is to sovle Bellman equation of $\pi_k$:
$$
v_{\pi_k} = r_{\pi_k} + \gamma P_{\pi_k}v_{\pi_k}
$$
This is the matrix-vector form of the Bellman equation, where $r_{\pi_k}$ and $P_{\pi_k}$ are known. Here, $v_{\pi_k}$ is the state value to be solved.
How to calculate state value $v_{\pi_k}$ is important. In section 2.4 we have already introduced how to solve Bellman equation to get state value $v_{\pi_k}$ as following
$$
\textcolor{blue}{v_{\pi_k}^{(j+1)} = r_{\pi_k} + \gamma P_{\pi_k}v_{\pi_k}^{(j)}}, \quad j=0,1,2,\dots
$$
Its clear that policy iteration is directly calculate the state value using the above equation, which means the calculation is infity step calculation.
Policy improvement:
The second step is to imporve the policy. How to do that? Once $v_{\pi_k}$ is calculated in the first step, a new and improved policy could be obtained as
$$
\pi_{k+1} = \arg \max_\pi (r_\pi + \gamma P_\pi v_{\pi_k})
$$
Elementwise form:
The policy evaluation step is to solve $v_{\pi_k}$ from the Bellman equation dirctly.
$$
\begin{aligned}
v_{\pi_k}^{(j+1)} & = r_{\pi_k} + \gamma P_{\pi_k} v_{\pi_k}^{(j)} \quad \text{(matrix-vector form)} \newline
\to v_{\pi_k}^{(j+1)}(s) & = \sum_a \pi_k(a|s) \Big(\sum_r p(r|s,a)r + \sum_{s^\prime} p(s^\prime|s,a) v_{\pi_k}^{(j)} (s^\prime) \Big) \quad \text{(elementwise form)}
\end{aligned}
$$
where $j=0,1,2,\dots$
The policy improvement step is to solve $\pi_{k+1}=\arg \max_\pi (r_\pi + \gamma P_{\pi_k}v_{\pi_k})$.
$$
\begin{aligned}
\pi_{k+1} & =\arg \max_{\pi_k} (r_{\pi_k} + \gamma P_{\pi_k}v_{\pi_k}) \quad \text{(matrix-vector form)} \newline
\to \pi_{k+1}(s) & = \arg \max_{\pi_k} \sum_a \pi(a|s)
\Big(\sum_r p(r|s,a)r + \sum_{s^\prime} p(s^\prime|s,a) v_{\pi_k} (s^\prime) \Big)
\quad \text{(elementwise form)} \newline
\to \pi_{k+1}(s) & = \sum_a \pi(a|s) \arg \max_a q_{\pi_k}(s,a)
\end{aligned}
$$
Let $a^*k(s) = \text{arg} \max_a q{\pi_k}(s,a)$. The greedy optimal policy is when $a=a_k^*(s)$, $\pi_{k+1}(a|s) = 1$, when $a\ne a_k^*(s)$, $\pi_{k+1}(a|s) = 0$, where $a_k^*(s) = \arg\max_a q_k(a,s)$.
Pesudocode:
