- a set of states , a set of actions
- transition function
- the model, the dynamics
Markov: given the present state, the future and the past are independent.
There are a set of states to transfer, and a set of actions to takes to transfer among states.
Taking actions on a specific states might lead to different states , with probability .
Transitioning between states by taking action would be assigned with a reward .
A policy is a mapping from to . To know which action to take on each state, one should know what action to take when in state .
With a policy , given a starting point, there is a deterministic state sequence . To evaluate the policy, we use the concept utility. To focus more on near future, use a discount to discount future rewards
To find a policy, the key is to average the discounted utility of with probabilities.
This is called Bellman Equation.
To calculate , we need to search the whole tree.
Policy Iteration:
- Initialize Policy
- Repeat until the policy stop changing
- for every states:
- Policy evaluation: given a policy , follow it and calculate the prob-averaged utility. (Bellman Equation without the operator): update the state values according to current policy.
- Policy extraction: once you have the value of all states, you update the policy without iterate all actions: extract the policy from the converged value above
- for every states:
- return the converged policy
Value Iteration:
- Initialize Values
- Repeat until the value stop changing (converge)
- for every states:
- look all utilities with different actions, take the biggest utility.
- for every states:
- Extract the policy from the converged value.
The probability function means that action takes on will leads to different with Probability .
stationary preferences
Discounted utility
Infinite Utilities
Solutions:
Deterministic Policy:
Stochastic Policy:
An optimal Policy : maximize the expected total discounted reward.
Policy Extraction:
Q value : the value of taking action in state
value : the value of the state
Finding the policy:
The optimal policy
which action to take for every possible state to maximize the cumulative reward over time.
future rewards are less certain than immediate ones, use a discount factor to prioritize sooner gains
Bellman Equation
is the values of states
Racing Search Tree
Time limited: to be the optimal value of if the game ends in more steps
Value Iteration algorithm
.
expectimax
- Policy evaluation: calculate utilities for some fixed policy until converges
- Policy improvement: update policy using one-step look-ahead with resulting converged utilities as future values
- repeat until converge
value iteration, policy iteration policy evaluation policy extraction (one-step lookahead)
variations of Bellman updates