Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction (Vol. 1, Ch. 4) Dynamic Programming. Cambridge: MIT press.

Introduction

Dynamic programming refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov decision process (MDP). DP is limited in it's utility due to it's assumptions and computational requirements, yet is theoretically important to understand. DP utilizes value functions to compute optimal policies. As we shall see, DP algorithms are obtained by turning Bellman equations into assignments to improving approximation of the desired value functions, $ v_* $ or $q_*$.

Bellman Optimality Equations:

$ v_*(s) = \displaystyle\max_a \mathbb{E} \left[R_{t+1} + \gamma v_*(S_{t+1}) \mid S_t = s, A_t = a\right] $

$ = \displaystyle\max_a \sum_{s',r} p(s',r \mid s,a) \left[ r + \gamma v_*(s') \right] $ , or

$ q_*(s,a) =\mathbb{E} \left[R_{t+1} + \gamma \displaystyle\max_{a'}q_*(S_{t+1},a') \mid S_t = s, A_t = a\right] $

$ = \sum_{s',r} p(s',r \mid s,a) \left[ r+ \gamma \displaystyle\max_{a'}q_*(s',a') \right] $

for all $ s \in \mathcal{S} $, $ a \in \mathcal{A}(s) $, and $ s' \in \mathcal{S}^+ $

Policy Evaluation

Policy evaluation entails computing the state-value function $v_\pi$ for an arbtrary policy $\pi$. For all, $ s \in \mathcal{S} $,

$ v_(s) = \mathbb{E} \left[G_t + \mid S_t = s\right] $

$ = \mathbb{E} \left[R_{t+1} + \gamma G_{t+1} + \mid S_t = s\right] $

$ = \mathbb{E} \left[R_{t+1} + \gamma v_\pi(S_{t+1}) + \mid S_t = s\right] $

$ = \displaystyle\sum_a \pi(a \mid s) \sum_{s',r} p(s',r \mid s,a) \left[r + \gamma v_\pi(s') \right]$

Where $ \pi(a \mid s) $ is the probability of taking $a$ in state $s$ under policy $\pi$. Assuming the environments dynamics are completely known,

$ \displaystyle\sum_a \pi(a \mid s) \sum_{s',r} p(s',r \mid s,a) \left[r + \gamma v_\pi(s') \right]$

is a system of $ \mid \mathcal{S} \mid $ simultaneous linear equations in $ \mid \mathcal{S} \mid $ unknowns (the $v_\pi(s),s \in \mathcal{S}$)

Iterative Policy Evaluation

The solution to this system of linear equations is straight forward. Starting from an arbitrary initial approximation, $v_0$, and considering a sequence of successive approximate value functions $v_0$, $v_1$, $v_0$,...$v_k$, each approximation is obtained using the bellman equation for $v_\pi$. For all $s \in \mathcal{S}$:

$ v_{k+1}(s) = \mathbb{E}_\pi \left[R_{t+1} + \gamma v_*(S_{t+1}) \mid S_t = s \right] $

$ = \displaystyle\sum_a \pi(a \mid s) \sum_{s',r} p(s',r \mid s,a) \left[r + \gamma v_\pi(s') \right]$

The sequence $\{v_k\}$ in general converges to $v_\pi$ as $k \rightarrow \infty$

Classic Grid World Example

Consider a square environment that consists of $n^2, \ n \ x \ n$ states $(s)$.

Nonterminal states are $\mathcal{S} = \{ 2,3,...,15 \}$. From each state, there are four possible neighboring states, $\mathcal{A} = \{up, down,right,left\} $.
Excluding states where actions would take the agent off the grid. i.e., if $s = 3$ - $\mathcal{A} = \{up\} $ leaves the state unchanged.
$R_t = -1 $ on all transitions, except those that lead to terminal states, where $R_t = 0$. The reward function for all states $s, s'$ and actions $a$ is $r(s,a,s') = -1$. Actions, $\mathcal{A}$ follow an equiprobable random policy (all actions equally likely).

Iterative Policy Evaluation in Grid World Code

Code is in black. Comments are in green.

import numpy as np

n = 4 # States = $n^2$
R = np.zeros((n,n))-1 # Intialize reward dynamics (all R=-1)
R[0,0] = 0 # First State R = 0
R[-1,-1] = 0 # Terminal State R = 0
a = np.arange(n) # States = $1:n$
states = np.array(np.meshgrid(a,a)).T.reshape(-1, 2) # Pairwise states i.e., $s = 2, \ \left[i,j\right] = \left[1,2\right]$
V = np.zeros(np.shape(R)) # Initialize $k = 0$ value estimates $v_0 = 0 $
actions = np.array([[1,0], #Up
[-1,0], #Down
[0,1], #Right
[0,-1]]) #Left
gamma = 1 # Temporal discounting, $\gamma = 1 $ (no discounting)
delta = float('inf') # Algorithm iterates until conversion $\delta \> \theta$
theta = 1e-4 # Theta set to small threshold
terminal_states = {(0, 0), (n - 1, n - 1)} # Ignore updating terminal states
while delta > theta: # Iiterate until convergence
delta = 0
oldV = V.copy() # Store previous k-1 value estimates
for i,j in states: # Pariwise states
if (i,j) in terminal_states: # Ignore updating if in terminal_states
continue
sPrime = [i,j] + actions # Possible Next States
idx = np.all((sPrime>=0) & (sPrime<=n-1),1) # Only include states that exist on the grid
sPrime = sPrime[idx,:]
pi = 1./np.size(sPrime,0) # Set equal probabilities based on number of possible states
V[i,j] = sum(pi * (R[i,j] + gamma * oldV[a,b]) for a,b in sPrime) # Iterate through possible next states for cumulative estimate
delta = max(delta,abs(V[i,j]-oldV[i,j])) # Check for convergence

Here is the process of convergence, as we can see the optimal $v_\pi$ converges to it's optimal structure fairly quickly.

Code to extract policy

##### Copmute Initial Policy #####
policy = np.zeros((n,n,2)) # Initialize empty policy
for i,j in states: # Loop through each state
if (i,j) in terminal_states: # Ignore updating if in terminal_states
continue
sPrime = [i,j] + actions # All actions from current state
idx = np.all((sPrime>=0) & (sPrime<=n-1),1)
sPrime = sPrime[idx,] # Only include feasible actions
possible_actions = actions[idx,]
v = np.array([V[a,b] for a,b in sPrime])# Compute the value of each action
isOptimal = np.argmax(v)# Select optimal (Ties chosen arbitrarily (Here as first occurence))
policy[i,j] = possible_actions[isOptimal,]# Update policy to include optimal action

Policy Improvement

How do you know the policy you computed is the optimal policy? For an arbitrary deterministic policy $\pi$ for some state $s$ we would like to know if we should choose an alternative action $a \neq \pi_(s)$ Once we have an existing policy actions are chosen as follows:

$ q_*(s,a) =\mathbb{E} \left[R_{t+1} + \gamma v _\pi(S_{t+1}) \mid S_t = s, A_t = a\right] $

$ = \sum_{s',r} p(s',r \mid s,a) \left[ r+ \gamma v _\pi(s') \right] $

We want to check if we can improve this policy by applying the policy improvement theorem. $\pi$ and $\pi'$ are any pair of deterministic policies such that for all, $s \in \mathcal{S}$,

$ q_\pi (s,\pi'(s)) \geq v_\pi(s)$

The policy $\pi'$ must be as good as or better than $\pi$. It must obtain greater or equl expected return from all states $s \in \mathcal{S}$.

$ v_\pi'(s) \geq v_\pi(s)$

Therefore a changed policy, $\pi'$ is identical to the original deterministic policy $\pi$ except that $\pi'(s) = a \neq \pi(s)$. If $q_\pi(s,a) \gt v_\pi(s)$ then the policy has improved.

Greedy Estimate

Greedy methods entail selecting the action at each state that appears the best according to the policy.

$\pi'(s) = \DeclareMathOperator*{\argmax}{arg\max} \displaystyle\argmax_a q_\pi(s,a) $
$ = \displaystyle\argmax_a \mathbb{E}\left[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s, A_t = a \right]$
$ = \displaystyle\argmax_a \sum_{s',r} p(s',r \mid s,a) \left[r + \gamma v_\pi(s') \right] $

$\displaystyle\argmax_a$ denotes the value of $a$ which is maximized. This process is called policy improvement. Policy improvement equates to the bellman equation, and will yield a strictly better policy except when the original policy is already optimal. For all $s \in \mathcal{S}$,

$ v_\pi'(s) = \displaystyle\max_a \mathbb{E}\left[R_{t+1} + \gamma v_\pi'(S_{t+1}) \mid S_t = s, A_t = a \right]$
$ = \displaystyle\max_a \sum_{s',r} p(s',r \mid s,a) \left[r + \gamma v_\pi'(s') \right] $

Policy Improvement in Grid World Code

We have already computed an initial set of value estimates to obtain an initial policy, $\pi$. Now we can update this policy using the greedy policy improvement method. Note that if $ q_\pi (s,\pi'(s)) = v_\pi(s)$ then the function will need only one update to converge.
To do this we will apply our Bellman equation to select the best action from each state.

##### Policy Improvement Update #####
policy_stable = False # Iterate until policy is stable
its = 0
while not policy_stable:
policy_stable = True # Initialize True
its+=1 # Count number of iterations to converge
oldV = V.copy() # Store old V
for i,j in states:
if (i,j) in terminal_states: # Ignore updating if in terminal_states
continue
sPrime = [i,j] + actions # Possible Next States
idx = np.all((sPrime>=0) & (sPrime<=n-1),1)# Only include states that exist on the grid
sPrime = sPrime[idx,]
possible_actions = actions[idx,]# Feasible next states
a = np.array([R[i,j] + gamma*oldV[a,b] for a,b in sPrime])# Store values of neighboring states
a = np.argmax(a) # Chose state with highest value
optimal = sPrime[a,]# Optimal next state
best_action = possible_actions[a,]# Optimal Action from current state
V[i,j] = R[i,j] + gamma * oldV[optimal[0],optimal[1]]# Update V based off greedy action
if not np.all(best_action == policy[i,j]):# If old action does not equal new action
policy_stable = False # Policy is not stable
policy[i,j] = best_action # Update to new action

As you can see we had already converged to the optimal policy, all actions remain the same.

Value Iteration

One of the draw backs to policy iteration is that it requires multiple sweeps throughout the state set, which means that convergence to $v_\pi$ occurs only in the limit. There must be a way to truncate this process, such as in cases where policy evaluation iterations beyond the first few have no effect on the corresponding greedy policy. Value iteration can be viewed as performing one sweep of policy evaluation under the current greedy policy, after which the greedy policy might change depending on what the values are after the sweep.

Value Iteration in Grid World Code

gamma = 1 # Set Gamma
delta = float('inf') # Algorithm iterates until conversion $\delta \> \theta$
theta = 1e-4 # Theta set to small threshold
while delta > theta: # Iiterate until convergence
delta = 0
oldV = V.copy() # Store previous k-1 value estimates
for i,j in states: # Pariwise states
if (i,j) in terminal_states: # Ignore updating if in terminal_states
continue
sPrime = [i,j] + actions # Possible Next States
idx = np.all((sPrime>=0) & (sPrime<=n-1),1) # Only include states that exist on the grid
sPrime = sPrime[idx,:]
pi = 1 # Probability of choosing greedy action
a = np.array([pi*R[i,j] + gamma*oldV[a,b] for a,b in sPrime]) # Store values of neighboring states
a = np.argmax(a) # Chose state with highest value
optimal = sPrime[a,] # Optimal next state
V[i,j] = pi*(R[i,j] + gamma*oldV[optimal[0],optimal[1]]) # Compute value for greedy choice
delta = max(delta,abs(V[i,j]-oldV[i,j]))

Value iteration converges to it's optimal fairly quickly, this is because it's taking advantage of sweeping through both value estiamtes and policy estimates simultaneously as opposed to doing them in seperate sweeps. Here is the value function after $k=3$ iterations:

Aysnchronous Dynamic Programming

A major drawback to these methods is they require full sweeps over the entire state set of the MDP. If the state is very large this can be computationally quite expensive. Aysnchronous dynamic algorithms are not organized in terms of sweeping over the entire state set. They are updated on the basis of whatever state values are available. One example of this could be to update states that the agent experiences in real time in the MDP. Such that only one state $s_k$ is updated on each time step using value iteration and asymptotic convergence to $v_\ast$ occurs only in the states that exist in the sequence $\{s_k\}$.

Efficiency

Dynamic Programming is thought to suffer from the curse of dimensionality due to non-linear growth of state variables. This is an inherent difficulty of the problem, not of DP as a method. In large data sets, asycnrhonous methods are preferred, due to the computational requirements of synchrnous methods to perform sweeps across the entire state-set.

Summary

In summary dynamic programming refers to a set of algorithms you can apply to solving finite markov decision processes. Policy evaluation and policy improvement can be combined to generate policy iteration and value iteration. The idea that these interacting processes revolve around an approximate policy and approximate value function is called generalize policy iteration.