Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction (Vol. 1, Ch. 4) Dynamic Programming. Cambridge: MIT press.
Dynamic programming refers to a collection of algorithms
that can be used to compute optimal policies given a perfect model of
the environment as a Markov decision process (MDP). DP is limited in it's utility
due to it's assumptions and computational requirements, yet is theoretically important to understand.
DP utilizes value functions to compute optimal policies. As we shall see, DP algorithms are obtained by
turning Bellman equations into assignments to improving approximation of the desired value functions, $ v_* $ or $q_*$.
Bellman Optimality Equations:
Policy evaluation entails computing the state-value function $v_\pi$ for an arbtrary policy $\pi$. For all, $ s \in \mathcal{S} $,
Where $ \pi(a \mid s) $ is the probability of taking $a$ in state $s$ under policy $\pi$.
Assuming the environments dynamics are completely known,
The solution to this system of linear equations is straight forward. Starting from an arbitrary initial approximation, $v_0$, and considering a sequence of successive approximate value functions $v_0$, $v_1$, $v_0$,...$v_k$, each approximation is obtained using the bellman equation for $v_\pi$. For all $s \in \mathcal{S}$:
The sequence $\{v_k\}$ in general converges to $v_\pi$ as $k \rightarrow \infty$
Consider a square environment that consists of $n^2, \ n \ x \ n$ states $(s)$.
Nonterminal states are $\mathcal{S} = \{ 2,3,...,15 \}$. From each state, there are four possible neighboring states, $\mathcal{A} = \{up, down,right,left\} $.
Excluding states where actions would take the agent off the grid. i.e., if $s = 3$ - $\mathcal{A} = \{up\} $ leaves the state unchanged.
$R_t = -1 $ on all transitions, except those that lead to terminal states, where $R_t = 0$. The reward function for all states $s, s'$ and actions $a$
is $r(s,a,s') = -1$. Actions, $\mathcal{A}$ follow an equiprobable random policy (all actions equally likely).
Code is in black. Comments are in green.
Here is the process of convergence, as we can see the optimal $v_\pi$ converges to it's optimal structure fairly quickly.
How do you know the policy you computed is the optimal policy? For an arbitrary deterministic policy $\pi$ for some state $s$ we would like to know if we should choose an alternative action $a \neq \pi_(s)$ Once we have an existing policy actions are chosen as follows:
We want to check if we can improve this policy by applying the policy improvement theorem. $\pi$ and $\pi'$ are any pair of deterministic policies such that for all, $s \in \mathcal{S}$,
The policy $\pi'$ must be as good as or better than $\pi$. It must obtain greater or equl expected return from all states $s \in \mathcal{S}$.
Therefore a changed policy, $\pi'$ is identical to the original deterministic policy $\pi$ except that $\pi'(s) = a \neq \pi(s)$. If $q_\pi(s,a) \gt v_\pi(s)$ then the policy has improved.
Greedy methods entail selecting the action at each state that appears the best according to the policy.
$\displaystyle\argmax_a$ denotes the value of $a$ which is maximized. This process is called policy improvement. Policy improvement equates to the bellman equation, and will yield a strictly better policy except when the original policy is already optimal. For all $s \in \mathcal{S}$,
We have already computed an initial set of value estimates to obtain an initial policy, $\pi$. Now we can update this policy using the
greedy policy improvement method. Note that if $ q_\pi (s,\pi'(s)) = v_\pi(s)$ then the function will need only one update to converge.
To do this we will apply our Bellman equation to select the best action from each state.
As you can see we had already converged to the optimal policy, all actions remain the same.
One of the draw backs to policy iteration is that it requires multiple sweeps throughout the state set, which means that convergence to $v_\pi$ occurs only in the limit. There must be a way to truncate this process, such as in cases where policy evaluation iterations beyond the first few have no effect on the corresponding greedy policy. Value iteration can be viewed as performing one sweep of policy evaluation under the current greedy policy, after which the greedy policy might change depending on what the values are after the sweep.
Value iteration converges to it's optimal fairly quickly, this is because it's taking advantage of sweeping through both value estiamtes and policy estimates simultaneously as opposed to doing them in seperate sweeps. Here is the value function after $k=3$ iterations:
A major drawback to these methods is they require full sweeps over the entire state set of the MDP. If the state is very large this can be computationally quite expensive. Aysnchronous dynamic algorithms are not organized in terms of sweeping over the entire state set. They are updated on the basis of whatever state values are available. One example of this could be to update states that the agent experiences in real time in the MDP. Such that only one state $s_k$ is updated on each time step using value iteration and asymptotic convergence to $v_\ast$ occurs only in the states that exist in the sequence $\{s_k\}$.
Dynamic Programming is thought to suffer from the curse of dimensionality due to non-linear growth of state variables. This is an inherent difficulty of the problem, not of DP as a method. In large data sets, asycnrhonous methods are preferred, due to the computational requirements of synchrnous methods to perform sweeps across the entire state-set.
In summary dynamic programming refers to a set of algorithms you can apply to solving finite markov decision processes. Policy evaluation and policy improvement can be combined to generate policy iteration and value iteration. The idea that these interacting processes revolve around an approximate policy and approximate value function is called generalize policy iteration.