Exercise Solutions

Introduction

See the source code on Github Repo, and if you have any questions, feel free to contact me at brycechen1849@gmail.com . It serves mainly as a public note for the book and it’s still being rapidly updated because I’m, at the same time, trying to get familiar with the RL research area.

References

The code implementations references are:

Solutions to exercise problems (However, this part are somewhat outdated because the latest version of the book has covered a lot of new exercises). Reinforcement-Learning-2nd-Edition-by-Sutton-Exercise-Solutions
Code for each figure in the book: reinforcement-learning-an-introduction

For figures, usage and examples can be accessed at Matplotlib Gallery

Solutions

Chapter 4

Exercise 4.1 In Example 4.1, if $\pi$ is the equiprobable random policy, what is $q_{\pi}(11,down)$? What is $q_{\pi}(7,down)$?

Ans:
\begin{equation} q_{\pi}(s,a) = \sum_{s’,r}{p(s’,r \mid s,a)(r+ \gamma v_{\pi})} \end{equation}
$q_{\pi}(7,down) = r + v_{\pi}(11) = -1 -14 = -15 $,
$q_{\pi}(11,down) = r + v_{\pi}(termination) = -1 + 0 = -1 $,
Exercise 4.2

Ans: If transitions of state 13 is unchanged, then state 15 is not reachable unless it’s born there.
So, state 15 won’t be successor state of any states, and it does not affect the value function.
v(15) = 1/4 x ( v(12) + v(13) + v(14) + v(15) )
v(15)= -18.7

If state 15 is down from state 13, and you can also go other states from here, it becomes intertwined thus will change values of all states.

Programming simulation
Exercise 4.3

Ans:
\begin{equation} q_{\pi}(s,a) = E_{\pi} \left[ R_{t+1} + \gamma q_{\pi}(S_{t+1}, A_{t+1}) \mid S_t = s, A_t = a \right] \end{equation}

\begin{equation} q_{\pi}(s,a) \doteq \sum_{s’,r}{ \left[ p(s’,r \mid s,a) \left[ r+ \gamma \sum_a{\pi(a \mid s’) q_{\pi}(s’,a)} \right] \right] } \end{equation}

\begin{equation} q_{k+1}(s,a) \doteq \sum_{s’,r}{ \left[ p(s’,r \mid s,a) \left[ r+ \gamma \sum_a{\pi(a \mid s’) q_{k}(s’,a)} \right] \right] } \end{equation}
Exercise 4.4

Ans: if old action not in the set of actions that equally maximized the equation, then update the policy.
Exercise 4.5

Ans: Everything remains except 2 points:
In step 2 we have the self-consistency equation of $q_{\pi}$ under policy $\pi$:
\begin{equation}
q_{\pi}(s,a) \gets \sum_{s’,r}{\left[p(s’,r \mid s,a) \times \left(r + \gamma \sum_{a’ \sim \pi}{\pi(a’ \mid s’) q_{\pi}(s’,a’)}\right)\right]}
\end{equation} In step 3 again we have $\pi(s)$ and update it to: \begin{equation} \pi(s) \gets argmax_{a} [q_{\pi}(s,a)] \end{equation}
Exercise 4.6

Ans:
In step 2 this term
\begin{equation} V_{\pi}(s) = \sum_{s’,r}{p(s’,r|s,\pi(s))(r + \gamma V_{\pi}(s’))} \end{equation}
is replaced by
\begin{equation} V_{\pi}(s) = \sum_{a \sim \pi’(s)}{\left[\pi’(s) \sum_{s’,r}{p(s’,r|s,\pi(s))(r + \gamma V_{\pi}(s’))}\right]} \end{equation}
where $\pi’(s)$ is $\epsilon-soft$
Exercise 4.7

Ans: Programming simulation
Exercise 4.8

Ans:
Exercise 4.9

Ans: Programming simulation
Exercise 4.10

Ans:
\begin{equation} Q_{k+1}(s,a) = \sum_{s’,r}{ p(s’,r|s,a) (r + \gamma Max_{a’}Q_{k}(s’,a’)) } \end{equation}
For each state-action pair, (s,a)
Exercise 4.11

Ans: Programming simulation

RL2BookSolutions

Notes for the book Reinforcement Learning: An Introduction 2nd Edition (By Sutton & Barto).

Exercise Solutions

Introduction

References

Solutions

Chapter 4