• a set of states , a set of actions
  • transition function
    • the model, the dynamics

Markov: given the present state, the future and the past are independent.

There are a set of states to transfer, and a set of actions to takes to transfer among states.

Taking actions on a specific states might lead to different states , with probability .

Transitioning between states by taking action would be assigned with a reward .

A policy is a mapping from to . To know which action to take on each state, one should know what action to take when in state .

With a policy , given a starting point, there is a deterministic state sequence . To evaluate the policy, we use the concept utility. To focus more on near future, use a discount to discount future rewards

To find a policy, the key is to average the discounted utility of with probabilities.

This is called Bellman Equation.

To calculate , we need to search the whole tree.

Policy Iteration:

  • Initialize Policy
  • Repeat until the policy stop changing
    • for every states:
      • Policy evaluation: given a policy , follow it and calculate the prob-averaged utility. (Bellman Equation without the operator): update the state values according to current policy.
      • Policy extraction: once you have the value of all states, you update the policy without iterate all actions: extract the policy from the converged value above
  • return the converged policy

Value Iteration:

  • Initialize Values
  • Repeat until the value stop changing (converge)
    • for every states:
      • look all utilities with different actions, take the biggest utility.
  • Extract the policy from the converged value.

The probability function means that action takes on will leads to different with Probability .

stationary preferences

Discounted utility

Infinite Utilities

Solutions:

Deterministic Policy:

Stochastic Policy:

An optimal Policy : maximize the expected total discounted reward.

Policy Extraction:

Q value : the value of taking action in state

value : the value of the state

Finding the policy:

The optimal policy

which action to take for every possible state to maximize the cumulative reward over time.

future rewards are less certain than immediate ones, use a discount factor to prioritize sooner gains

Bellman Equation

is the values of states

Racing Search Tree

Time limited: to be the optimal value of if the game ends in more steps

Value Iteration algorithm

.

expectimax

  • Policy evaluation: calculate utilities for some fixed policy until converges
  • Policy improvement: update policy using one-step look-ahead with resulting converged utilities as future values
  • repeat until converge

value iteration, policy iteration policy evaluation policy extraction (one-step lookahead)

variations of Bellman updates