Reinforcement Learning with Code.

This note record how the author begin to learn RL. Both theoretical understanding and code practice are presented. Many material are referenced such as ZhaoShiyu’s Mathematical Foundation of Reinforcement Learning .

Chapter 1. Basic Concepts

1.1 State and action

State describe the status of the agent with respect to the environment, denoted by $s$.
State space is the set of all states, denoted by $\mathcal{S}={s_1, s_2,\dots,s_n}$.
Action describe the action that the agent may take with respect to the environment, denoted by $a$.
Action space is the set of all actions, denoted by $\mathcal{A}={a_1, a_2,\dots,a_n}.$

1.2 State transition

When taking an action, the agent may move from one state to another. Such a process is called state transition. State trasition can be denoted by
$$
s_1 \stackrel{a_2}\longrightarrow s_2
$$
described by $p(s^\prime|s,a)$.

State trainsition can be both deterministic and stochastic. For example the deterministic state transition is
$$
p(s_1|s_1,a_2) = 0 \
p(s_2|s_1,a_2) = 1 \
p(s_3|s_1,a_2) = 0
$$
For example the stochastic state transition is
$$
p(s_1|s_1,a_2) = 0.5 \
p(s_2|s_1,a_2) = 0.3 \
p(s_3|s_1,a_2) = 0.2
$$

1.3 Policy

Policy tells the agents which actions to take at each state, denoted by $\pi$.
Policy is described by conditional probability.
Policy can be deterministic or stochastic, which means one state has a deterministic action or one state has probability to select other actions.

Suppose the actions space is $\mathcal{A}={a_1, a_2,a_3}$, such deterministic policy can be dentoed by
$$
\pi(a_1|s_1) = 0 \
\pi(a_2|s_1) = 1 \
\pi(a_3|s_1) = 0
$$
which indicated the probability of taking action $a_2$ is $1$ and others are zero.

Such stochastic policy can be denoted by
$$
\pi(a_1|s_1) = 0.5 \
\pi(a_2|s_1) = 0.3 \
\pi(a_3|s_1) = 0.2
$$

1.4 Reward

Reward is one of the most unique concept in RL.
Immediate reward can be obtained after taking an action.
Reward transition is the process of getting a reward after taking an action, reward transition can be deterministic or stochastic. Reward transition is described by $p(r|s,a)$

For example deterministic reward transition can be denoted by
$$
p(r=-1|s_1,a_2) = 1, p(r\ne -1|s_1,a_2)=0
$$
which means at state $s_1$ take action $a_2$ the probability to get immediate reward $-1$ is $1$.

Stochastic reward transition can be denoted by
$$
p(r=1|s_1,a_2) = 0.5, p(r= 0|s_1,a_2)=0.5
$$
which means at state $s_1$ take action $a_2$ the probability to get immediate reward $1$ is $0.5$, the probability to get immediate reward $0$ is $0.5$.

1.5 Trajectory, return, episode

Trajectory is a state-action-reward chain, such as $s_1 \underset{r=0}{\xrightarrow{a_2}} s_2 \underset{r=0}{\xrightarrow{a_3}} s_3 \underset{r=0}{\xrightarrow{a_4}} \cdots\underset{r=1}{\xrightarrow{a_n}} s_{n}$.
Return of this trajecotry is the sum of all the rewards collected along the trajectory, such as $\text{return} = 0+0+0+\cdots+1=1$. Return is also called total rewards or cumulative rewards.
Discounted return is defined by the discounted rate, denoted by $\gamma\in(0,1)$. Such discounted retrun is
$$
\text{discounted return} = 0+\gamma0+\gamma^2 0 + \gamma^3 0 + \cdots + \gamma^n 1
$$
Episode refers the trajectory that interacting with the enviornment following a policy until the agent reach the terminal state. An episode is usually assumed to be a finite trajectory, that task with episodes are called episodic tasks. Some task may have no terminal state, such task is called continuing tasks.

1.6 Markov decision process (MDP)

Markov decision process is a general framework to describe stochastic dynamical systems. The key ingredients of an MDP are listed:

Sets:
- State set: the set of all states, denoted as $\mathcal{S}$.
- Actions set: a set of actions, denoted as $\mathcal{A}(s)$, is associated for each state $s\in\mathcal{S}$.
- Reward set: a set of rewards, denoted as $\mathcal{R}(s,a)$, is associated for each state action pari $(s,a)$.
Model:
- State transition probability: at state $s$, taking actions $a$, the probability to transit to state $s^\prime$ is $p(s^\prime|s,a)$.
- Reward transition probability: at state $s$, taking action $a$, the probability to get reward $r$ is $p(r|s,a)$.
Policy: as state $s$, the probability to choose action $a$ is $\pi(a|s)$.
Markov property: one key property of MDPs is the Markov property, which refers to the memoryless property of a stochastic process, which means
$$
p(s_{t+1}|s_t,a_t,s_{t-1},a_{t-1},\dots,s_0,a_0)=p(s_{t+1}|s_t,a_t) \
p(r_{t+1}|s_t,a_t,s_{t-1},a_{t-1},\dots,s_0,a_0)=p(r_{t+1}|s_t,a_t) \
$$

Reference
赵世钰老师的课程