Reinforcement Learning with Code.

This note records how the author begin to learn RL. Both theoretical understanding and code practice are presented. Many material are referenced such as ZhaoShiyu’s Mathematical Foundation of Reinforcement Learning, .

Chapter 3. Optimal State Value and Bellman Optimality Equation

3.1 How to define optimal

One core idea is that we use the action value to judge the optimality of the action. If we update the policy to select the action with the greatest action value, we could find a better policy.

(Optimal policy and optimal state value). A policy $\pi^*$ is optimal if $v_{\pi^*}(s) \ge v_\pi(s)$ for all $s\in\mathcal{S}$ and for any other policy $\pi$. The state values of $\pi^*$ are the optimal state values.

Why does the definition works? One intuitive explanation is that state value $v_\pi(s)$ denote the mean of return along the trajectory following policy $\pi$. If the policy $\pi^*$ has greatest expectation of return, hence we can believe that the policy $\pi^*$ is optimal.

3.2 Bellman optimal equation (BOE)

The Bellamn optimal eqaution (BOE) is

$$
\begin{aligned}
v(s) & = \max_\pi \sum_a \pi(a|s) \Big[ \sum_r p(r|s,a)r + \gamma \sum_{s^\prime} p(s^{\prime}|s,a) v_\pi(s^{\prime}) \Big] \newline
\textcolor{red}{v(s)} & \textcolor{red}{= \max_\pi \sum_a \pi(a|s) q_\pi(s,a)}
\end{aligned}
$$

There is a problem that the BOE has two unknown variables $q_\pi(s,a)$ and $\pi(a|s)$. How can we solve the BOE?

The idea is that we can fix one variable and solve the maximization problem. For Bellman optimal equation, we can fix variable $\pi(a|s)$ for all $a\in\mathcal{A}(s)$ and then by maximize the state value to find the optimal policy $\pi$.

Before analysis, we consider one example first. Suppose $x_1,x_2,x_3\in\mathbb{R}$ are given. Find $c_1^*,c_2^*,c_3^*$ solving
$$
\max_{c_1,c_2,c_3} c_1x_1+c_2x_2+c_3x_3 \newline
\text{subject to } c_1+c_2+c_3=1
$$
Without generality, suppose $x_3 \ge x_1, x_2$. Then, the optimal solution is $c_3^*=1$ and $c_1^*=c_2^*=0$. This is because for any $c_1,c_2,c_3$
$$
x_3 = (c_1+c_2+c_3)x_3 = c_1x_3 + c_2x_3 + c_3x_3 \ge c_1x_1 +c_2x_2+c_3x_3
$$
Hence, inspired by the above example, considering that $\sum_a \pi(a|s)=1$, we have

$$
\begin{aligned}
v(s) & = \max_\pi \sum_a \pi(a|s) q_\pi(s,a) \newline
& = \sum_a \pi(a|s) \max_\pi q_\pi(s,a) \quad \text{by fix } \pi(a|s) \newline
& = \sum_a \pi(a|s) \max_{a\in\mathcal{A}} q_\pi(s,a) \newline
& \le \max_{a\in\mathcal{A}} q_\pi(s,a)
\end{aligned}
$$

where the equality is achieved when $a=a^*$, $\pi(a|s)=1$ when $a\ne a^*$, $\pi(a|s)=0$.

$$
a^* = \text{arg} \max_a q(s,a)
$$

This policy is often called greedy policy.

3.3 Matrix-vector form of Bellman optimal equation

Refering to matrix-vector form of Bellman equation, it’s obvious to get matrix-vector form of Bellman optimal equation as follows
$$
\textcolor{blue} {v = \max_\pi (r_\pi +\gamma P_\pi v)}
$$
where $v\in\mathbb{R^{|\mathcal{S}|}}$. The structure of $r_\pi$ and $P_\pi$ are the same as those in the matrix-vector form of Bellman equation:

$$
[r_\pi]_s \triangleq \sum_a \pi(a|s) \sum_r p(r|s,a)r
$$

$$
[P_\pi]{s, s^\prime} = \sum_a \pi(a|s) \sum{s^\prime} p(s^\prime|s,a)
$$

Furthermore, denote the right hand side as

$$
f(v) \triangleq \max_\pi(r_\pi + \gamma P_\pi v)
$$

Hence, we have the Bellman optimal equation as

$$
v=f(v) \newline
f(v) - v = 0
$$

It turns solving Bellman optimal equation to solving find the root of function $\textcolor{blue}{f(v) - v =0}$.

3.4 Contraction mapping theorem

Fixed point: consider a function $f(x)$ where $x\in\mathbb{R}^b$ and $f:\mathbb{R}^b\to\mathbb{R}^b$. A point $x^*$ is called a fixed point if
$$
f(x^*) = x^*
$$
The function $f$ is called a contracting mapping if there exists $\gamma\in(0,1)$ such that
$$
||f(x_1)-f(x_2)|| \le \gamma ||x_1 - x_2||
$$
for any $x_1, x_2\in \mathbb{R}$.

(Contraction mapping theorem) For any equation that has the form of $x=f(x)$ where $x$ and $f(x)$ are real vectors, if $f$ is a contraction mapping, then

Existence: There exists a fixed point $x^*$ satisfying $f(x^*)=x^*$.
Uniqueness: The fixed point $x^*$ is unique.
Algorithm: Consider the iterative process:
$$
x_{k+1} = f(x_k)
$$
where $k=0,1,2,\dots$. Then , $x_k\to x^*$ as $k\to \infty$ for any initial guess $x_0$. Moreover, the convergence rate is exponentially fast.

3.5 Solution of BOE

(Contraction property of BOE). The function $f(v) \triangleq \max_\pi(r_\pi + \gamma P_\pi v)$ in the BOE is a contraction mapping statisfying
$$
||f(v_1)-f(v_2)|| \le \gamma ||v_1 - v_2||
$$
where $v_1,v_2\in\mathbb{R}^{|\mathcal{S}|}$ are any two vectors and $\gamma\in(0,1)$ is the discounted rate.

The proof is omitted.

Using contraction mapping theorem to solve BOE, we have the following theorem:

(Existence, Uniqueness, Algorithm). For the BOE $v=f(v) \triangleq \max_\pi(r_\pi + \gamma P_\pi v)$, there always exists a unique solution $v^*$, which can be solved iteratively by
$$
v_{k+1} = f(v_k) = \max_\pi(r_\pi + \gamma P_\pi v_k), \quad k=0,1,\dots
$$
The sequence ${v(k)}$ converges to optimal solution of BOE $v^*$ exponentially fast given any inital guess $v_0$.

This algorithm is actually called value iteration.

First, following the algorithm above will give us the optimal state value $v^*$. Second, we solve the unknown policy $\pi$ in the BOE. Suppose $v^*$ has been obtained and
$$
\pi^* = \text{arg}\max_\pi(r_\pi+\gamma P_\pi v^*)
$$
Then, $v^*$ and $\pi^*$ statisfy
$$
v^* = r_{\pi^*} + \gamma P_{\pi^*} v^*
$$
(Greedy optimal policy). For any $s\in\mathcal{S}$, the deterministic greedy policy

where the equality is achieved when $a=a^*$, $\pi(a|s)=1$ when $a\ne a^*$, $\pi(a|s)=0$.

is an optimal policy solving the BOE. Here
$$
a^*(s) = \text{arg} \max_a q^*(a,s)
$$
where
$$
q^*(s,a) \triangleq \sum_r p(r|s,a)r + \gamma \sum_{s^\prime} p(s^{\prime}|s,a) v^*(s)
$$
The matrix-vecotr form of the optimal policy is $\pi^*=\text{arg}\max_\pi(r_\pi+\gamma P_\pi v^*(s))$. Its elementwise form is

$$
\begin{aligned}
\pi^* & = \text{arg}\max_\pi(r_\pi+\gamma P_\pi v^*) \newline
\to \pi^*(s) & = \text{arg}\max_\pi \sum_a \pi(a|s) \Big( \sum_r p(r|s,a)r + \gamma \sum_{s^\prime} p(s^{\prime}|s,a) v^*(s) \Big) \newline
\to \pi^*(s) & = \text{arg}\max_\pi \sum_a \pi(a|s) q^*(s,a), \quad \text{for s}\in\mathcal{S}
\end{aligned}
$$

It is clear that $\sum_a \pi(a|s)q^*(s,a)$ is maximized if $\pi(s)$ selects the action with the greatest value of $q^*(s,a)$.