Reinforcement Learning with Code.

This note records how the author begin to learn RL. Both theoretical understanding and code practice are presented. Many material are referenced such as ZhaoShiyu’s Mathematical Foundation of Reinforcement Learning, .

Chapter 2. State Value and Bellman Equation

2.1 State value

State value is defined as the mean of all possible returns starting from a state, which is actually the expectation of return from a specific state.
The mathematical definition is as follows:

Note that the capital letters denote random variables, such as $S_t, S_{t+1}, A_t, R_{t+1}$. In particular, $S_t,S_{t+1}\in\mathcal{S},A_t\in\mathcal{A}(S_t)$ and $R_{t+1}\in\mathcal{R}(S_t,A_t)$. $G_t$ denote the random variable of return.

Starting from $t$, we can obtain a state-action-reward trajectory:

$$
S_t \xrightarrow{A_t} S_{t+1},R_{t+1} \xrightarrow{A_{t+1}} S_{t+2},R_{t+2} \xrightarrow{A_{t+2}} S_{t+3},R_{t+3} \cdots
$$

The discounted return along the trajectory is

$$
G_t = R_{t+1} + \gamma R_{t+2} + \gamma R_{t+3} + \cdots
$$

The state value is defined as:

$$
\textcolor{blue}{v_\pi(s)\triangleq \mathbb{E}[G_t|S_t=s]}
$$

which means start from state $s$ can get the expectation return along the trajecotry generated by policy $\pi$ .

$v_\pi(s)$ is also called state-value funtion.

2.2 Bellman Equation

Bellman equation is a set of linear equations describing the relationship among the values of all the states.

For example,

Let $v_i$ denote the return obtained starting from $s_i$. The return starting from the four states in figure can be respectively calculated as

$$
v_1 = r_1 + \gamma r_2 + \gamma^2 r_3 + \cdots \newline
v_2 = r_2 + \gamma r_3 + \gamma^2 r_4 + \cdots \newline
v_3 = r_3 + \gamma r_4 + \gamma^2 r_1 + \cdots \newline
v_4 = r_4 + \gamma r_1 + \gamma^2 r_2 + \cdots
$$

Using the idea of bootstrapping, we can rewrite it to

$$
v_1 = r_1 + \gamma (r_2 + \gamma r_3) + \cdots = r_1 + \gamma v_2 \newline
v_2 = r_2 + \gamma (r_3 + \gamma r_4) + \cdots = r_2 + \gamma v_3 \newline
v_3 = r_3 + \gamma (r_4 + \gamma r_1) + \cdots = r_3 + \gamma v_4 \newline
v_4 = r_4 + \gamma (r_1 + \gamma r_2) + \cdots = r_4 + \gamma v_1
$$

Then rewrite it into matrix form

$$
\textbf{v} = \begin{bmatrix} v_1 \newline v_2 \newline v_3 \newline v_4 \end{bmatrix} , \textbf{r} = \begin{bmatrix} r_1 \newline r_2 \newline r_3 \newline r_4 \end{bmatrix}, \textbf{v} = \begin{bmatrix} v_1 \newline v_2 \newline v_3 \newline v_4 \end{bmatrix}
$$

$$
\textbf{P} = \begin{bmatrix} 0 & 1 & 0 & 0 \newline 0 & 0 & 1 & 0 \newline 0 & 0 & 0 & 1 \newline 1 & 0 & 0 & 0 \end{bmatrix}
$$

The equation $\textbf{v}=\textbf{r}+\gamma\textbf{P}\textbf{v}$ is called Bellman equation.

Then we can derive Bellman equation from scratch as follows:

Note that the discounted return random variable $G_t$ can be rewritten as

$$
\begin{aligned}
G_t & = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots \newline
& = R_{t+1} + \gamma(R_{t+2}+\gamma R_{t+3} + \cdots) \newline
& = R_{t+1} + \gamma G_{t+1}
\end{aligned}
$$

where $G_{t+1} = R_{t+2} + \gamma R_{t+3} + \cdots$ . This equation establishes the relationship between $G_t$ and $G_{t+1}$. Then the state value can be rewritten as

$$
\begin{aligned}
v_\pi(s) & = \mathbb{E}[G_t|S_t=s] \newline
& = \mathbb{E}[R_{t+1} + \gamma G_{t+1} | S_t=s] \newline
& = \mathbb{E}[R_{t+1}|S_t=s] + \gamma \mathbb{E}[G_{t+1}|S_t=s]
\quad \text{(linear property of expectation)}
\end{aligned}
$$

The mean of immediate reward can be written as (use 2.3 conditional expectation)

$$
\begin{aligned}
\mathbb{E}[R_{t+1}|S_t=s] & = \sum_a \pi(a|s) \mathbb{E}[R_{t+1} | S_t=s,A_t=a] \quad \text{(conditional expectation)} \newline
& = \sum_a \pi(a|s) \sum_r p(r|s,a)r
\end{aligned}
$$

The mean of future reward can be written as

$$
\begin{aligned}
\gamma \mathbb{E}[G_{t+1}|S_t=s] & = \gamma \sum_a \pi(a|s) \mathbb{E}[G_{t+1}|S_t=s, A_t=a] \quad \text{(conditional expectation)} \newline
& = \gamma \sum_a \pi(a|s) \sum_{s^\prime} p(s^\prime|s,a) v_\pi(s^\prime) \newline
\end{aligned}
$$

Then the state value can be rewritten as Bellman equation (BE) form

$$
\begin{aligned}
v_\pi(s) & = \mathbb{E}[G_t|S_t=s] \newline
& = \mathbb{E}[R_{t+1}|S_t=s] + \gamma \mathbb{E}[G_{t+1}|S_t=s] \newline
& =
\sum_a \pi(a|s) \sum_r p(r|s,a)r

+
\gamma \sum_a \pi(a|s) \sum_{s^\prime} p(s^\prime|s,a) v_\pi(s^\prime) \newline
& = \sum_a \pi(a|s) \Big[\sum_r p(r|s,a)r + \gamma\sum_{s^\prime} p(s^\prime|s,a) v_\pi(s^\prime) \Big], \quad \text{for all } s\in\mathcal{S}
\end{aligned}
$$

Next, we will introduce the matrix vector form of Bellman equation in terms of state value. Let $v_\pi=[v_\pi(s_1),v_\pi(s_2),\cdots]^T\in\mathbb{R}^n$, $r_\pi=[r_\pi(s_1),r_\pi(s_2),\cdots]^T\in\mathbb{R}^n$, and $P_\pi\in\mathbb{R}^{n\times n}$. Then we have the matrix-vector form of Bellman equation in terms of state value.

$$
\textcolor{red}{v_\pi = r + \gamma P_\pi v_\pi} \quad \text{(matrix-vector form)}
$$

where

$$
[r_\pi]s \triangleq \sum_a \pi(a|s) \sum_r p(r|s,a)r,
[P_\pi]{s,s^\prime} = \sum_a \pi(a|s) \sum_{s^\prime} p(s^\prime|s,a)
$$

2.3 Review conditional expectaion

The definition of conditional expectation is

$$
\mathbb{E} = [X|A=a] = \sum_{x}xp(x|a)
$$

Similar to the law of total probability, we have the law of total expectation:

$$
\mathbb{E}[X] = \sum_a p(a) \mathbb{E}[X|A=a]
$$

Proof

By definition of expectation the right hand side can be written as

$$
\begin{aligned}
\mathbb{E}[X] & = \sum_a p(a) \sum_x x p(x|a) \newline
& = \sum_x \sum_a p(x|a) p(a) \quad \text{(law of total probability)} \newline
& = \sum_x p(x) \newline
& = \mathbb{E}[X]
\end{aligned}
$$

2.4 Solve Bellman Equation

From the analysis above we are informed that Bellman equation is

$$
\textcolor{blue}{v_\pi(s)=\sum_a \pi(a|s) \Big[\sum_r p(r|s,a)r + \gamma\sum_{s^\prime} p(s^\prime|s,a) v_\pi(s^\prime) \Big]}
$$

We can rewrite it into the follow form

$$
v_\pi(s) = r_\pi(s) + \gamma \sum_{s^\prime} p_\pi(s^\prime|s) v_\pi(s^\prime)
$$

where

$$
\begin{aligned}
r_\pi(s) \triangleq \sum_a \pi(a|s) \sum_r p(r|s,a)r \newline
\newline
\newline
\newline
\end{aligned} \qquad
\begin{aligned}
p_\pi(s^\prime|s) & \triangleq \sum_a \pi(a|s) \sum_{s^\prime} p(s^\prime|s,a) \newline
& = \sum_{s^\prime} \sum_a \pi(a|s) p(s^\prime|s,a) \newline
& = p_\pi(s^\prime|s)
\end{aligned}
$$

Suppose the states are indexed as $s_i$. For state $s_i$, the Bellman equation is

$$
v_\pi(s_i) = r_\pi(s_i) + \gamma \sum_{s_j} p_\pi(s_j|s_i) v_\pi(s_j)
$$

Let $v_\pi=[v_\pi(s_1),\dots,v_\pi(s_n)]^T \in \mathbb{R}^n$, $r_\pi=[r_\pi(s_1),\dots,r_\pi(s_n)]^T \in \mathbb{R}^n$, and $P_\pi \in \mathbb{R}^{n\times n}$, where $[P_\pi]_{ij} = p_\pi(s_j|s_i)$. Hence we have the matrix form as

$$
v_\pi = r_\pi + \gamma P_\pi v_\pi
$$

For example, consider four states $s_i$ and four actions $a_i$ the matrix form can be

$$
\underbrace{
\begin{bmatrix}
v_\pi(s_1) \newline
v_\pi(s_2) \newline
v_\pi(s_3) \newline
v_\pi(s_4)
\end{bmatrix}
}{v_\pi} =
\underbrace{
\begin{bmatrix}
r_\pi(s_1) \newline
r_\pi(s_1) \newline
r_\pi(s_1) \newline
r_\pi(s_1)
\end{bmatrix}
}{r_\pi} + \gamma
\underbrace{
\begin{bmatrix}
p_\pi(s_1|s_1) & p_\pi(s_2|s_1) & p_\pi(s_3|s_1) & p_\pi(s_4|s_1) \newline
p_\pi(s_1|s_2) & p_\pi(s_2|s_2) & p_\pi(s_3|s_2) & p_\pi(s_4|s_2) \newline
p_\pi(s_1|s_3) & p_\pi(s_2|s_3) & p_\pi(s_3|s_3) & p_\pi(s_4|s_3) \newline
p_\pi(s_1|s_4) & p_\pi(s_2|s_4) & p_\pi(s_3|s_4) & p_\pi(s_4|s_4) \newline
\end{bmatrix}
}{P_\pi}
\underbrace{
\begin{bmatrix}
v_\pi(s_1) \newline
v_\pi(s_2) \newline
v_\pi(s_3) \newline
v_\pi(s_4)
\end{bmatrix}
}{v_\pi}
$$

Conclusions:

So we have the closed form of the solution as $v_\pi = (I-\gamma P_\pi)^{-1}r_\pi$. However the inverse operation is hard to implement.
We seek for other iterative solving methods.

Iterative solution

We can directly sovle the Bellman equation using the following iterative algorithm:

$$
\textcolor{blue}{v_{k+1} = r_\pi + \gamma P_\pi v_k}
$$

The proof is omitted.

2.5 Action value

Action value is denoted by $q_\pi(s,a)$ which is defined as

$$
q_\pi(s,a) \triangleq \mathbb{E}[G_t|S_t=s, A_t=a]
$$

for all $s \in \mathcal{S}, a\in \mathcal{A}(s)$.

Action value can be interpreted as average return along trajectory generated by policy $\pi$ after taking a specific action $a$.

From the conditional expectation property that

$$
\underbrace{\mathbb{E}[G_t|S_t=s]}{v_\pi(s)} = \sum_a \pi(a|s) \underbrace{\mathbb{E}[G_t|S_t=s,A_t=a]}{q_\pi(s)}
$$

Hence,

$$
v_\pi(s) = \sum_a \pi(a|s) q_\pi(s)
$$

So we can obtain the mathematical definition of action value as

$$
\textcolor{blue}{q_\pi(s)=\sum_r p(r|s,a)r + \gamma\sum_{s^\prime} p(s^\prime|s,a) v_\pi(s^\prime) }
$$

Substituting $(1)$ into $(2)$ we have

$$
q_\pi(s,a)=\sum_r p(r|s,a)r + \gamma\sum_{s^\prime} p(s^\prime|s,a) \sum_{a^\prime \in \mathcal{A}(s^\prime)}\pi(a^\prime|s^\prime) q_\pi(s^\prime,a^\prime)
$$

Suppose each state has the same number of actions. The matrix-vecotr form of Bellman eqaution in terms of action value is

$$
\textcolor{red}{q_\pi = \tilde{r} + \gamma P \Pi q_\pi} \quad \text{(matrix-vector form)}
$$

where $q_\pi \in \mathbb{R}^{|\mathcal{S}||\mathcal{A}|}$ is the action value vector indexed by state-action pairs. In particular, the $(s,a)$th element is

$$
[q_\pi]_{(s,a)} = q_\pi(s,a)
$$

Here, $\tilde{r}\in\mathbb{R}^{|\mathcal{S} ||\mathcal{A} |}$ is the immediate reward vector indexed by state-action pairs. In paricular, the $(s,a)$th reward is

$$
[\tilde{r}]_{(s,a)} = \sum_r p(r|s,a)r
$$

Here, $P\in \mathbb{R}^{|\mathcal{S}||\mathcal{A}|\times|\mathcal{S}| }$ is the probability transition matrix, whose row is indexed by state-action pairs and column indexed by states. In particular

$$
[P]_{(s,a),s^\prime} = p(s^\prime|s,a)
$$

And $\Pi \in\mathbb{R}^{|\mathcal{S}|\times|\mathcal{S}||\mathcal{A}|}$ describes the policy $\pi$. In particular

$$
\Pi_{s^\prime, (s^\prime,a^\prime)} = \pi(a^\prime|s^\prime)
$$

and the other entries of $\Pi$ is zero. $\Pi$ is block diagonal matrix with each block as a $1\times|\mathcal{A}|$ vector.

Reference

赵世钰老师的课程