Markov Decision Process

a set of states $s \in S$ , a set of actions $a \in A$
transition function $T (s, a, s^{'})$
- $P (s^{'} ∣ s, a)$
- the model, the dynamics

Markov: given the present state, the future and the past are independent.

There are a set of states $S$ to transfer, and a set of actions $a \in A$ to takes to transfer among states.

Taking actions on a specific states $s$ might lead to different states $s^{'}$ , with probability $P (s^{'} ∣ a, s)$ .

Transitioning between states by taking action $a$ would be assigned with a reward $R (s^{'} ∣ s, a)$ .

A policy is a mapping from $s \in S$ to $a \in A$ . To know which action to take on each state, one should know what action $a$ to take when in state $s$ .

With a policy $π (s) = a$ , given a starting point, there is a deterministic state sequence $\[s_0, s_1, s_2 \dots]$ . To evaluate the policy, we use the concept utility. To focus more on near future, use a discount $γ$ to discount future rewards

U (\[s_0, s_1, s_2, \dots]) = R (s_0) + γ R (s_1) + γ^{2} R (s_2) + \dots

To find a policy, the key is to average the discounted utility of with probabilities.

V (s) = max_a \sum_s^{'} P (s^{'} ∣ a, s) \[R (s^{'} ∣ a, s) + γV (s^{'})]

This is called Bellman Equation.

To calculate $V (s)$ , we need to search the whole tree.

Policy Iteration:

Initialize Policy
Repeat until the policy stop changing
- for every states:
  - Policy evaluation: given a policy $π$ , follow it and calculate the prob-averaged utility. (Bellman Equation without the $max_a$ operator): update the state values according to current policy.
  - Policy extraction: once you have the value of all states, you update the policy without iterate all actions: extract the policy from the converged value above
return the converged policy

Value Iteration:

Initialize Values
Repeat until the value stop changing (converge)
- for every states:
  - look all utilities with different actions, take the biggest utility.
Extract the policy from the converged value.

The probability function $P$ means that action $a$ takes on $s$ will leads to different $s^{'}$ with Probability $P (s, a, s^{'})$ .

stationary preferences

$\[a_1, a_2, \dots]$

Discounted utility

Infinite Utilities

Solutions:

U (\[r_0, \dots, r_\infty]) = \sum_t = 0^{\infty} γ^{t} r_t \leq \frac{R _ max}{1 - γ}

Deterministic Policy:

π (s) = a

Stochastic Policy:

π (a ∣ s) = P (A_t = a ∣ S_t = s)

An optimal Policy $π^{\*}$ : maximize the expected total discounted reward.

Policy Extraction:

\pi^{_}(s) = \arg\max\_{a} \sum\_{s'} P(s, a, s')\[R(s,a,s') + \gamma V^{_}(s')]

Q value $Q (s, a)$ : the value of taking action $a$ in state $s$

$V$ value $V (s)$ : the value of the state

Finding the policy:

The optimal policy $π^{\*}$

which action to take for every possible state to maximize the cumulative reward over time.

future rewards are less certain than immediate ones, use a discount factor $γ$ to prioritize sooner gains

Bellman Equation

V (s) = max_a \sum_s^{'} P (s, a, s^{'}) \[R (s, a, s^{'}) + γV (s^{'})]

$V$ is the values of states

Racing Search Tree

Time limited: $V_k (s)$ to be the optimal value of $s$ if the game ends in $k$ more steps

Value Iteration algorithm

$V_0 (s) = 0$ .

expectimax

Policy evaluation: calculate utilities for some fixed policy until converges
Policy improvement: update policy using one-step look-ahead with resulting converged utilities as future values
repeat until converge

value iteration, policy iteration policy evaluation policy extraction (one-step lookahead)

variations of Bellman updates

Kai's Public Notes

Explorer

Markov Decision Process

Graph View