Skip to main content

2. MDP

Really useful link for learning RL and Game theory

Markov Decision Process

The Bellman Equation

V(s)=maxaA(s)sSPa(ss)[r(s,a,s)+γV(s)]V(s) = \max_{a \in A(s)} \sum_{s' \in S} P_a(s'|s) [r(s, a, s') + \gamma V(s')]
  • γ\gamma : discount factor
  • Pa(ss)P_a(s'|s) : transition probability
  • r(s,a,s)r(s, a, s') : reward

Policy Extraction : DECIDING HOW TO ACT

π(s)=argmaxaA(s)sSPa(ss)[r(s,a,s)+γV(s)]\pi(s) = \arg \max_{a \in A(s)} \sum_{s' \in S} P_a(s'|s) [r(s, a, s') + \gamma V(s')]

(提取最优action)

The multi-armed bandit problem

A mukti-armed bandit (also known as an N-armded bandit) is defined by a set of random variables Xi,kX_{i,k} where:

  • 1iN1 \leq i \leq N such that ii is the arm of the bandit; and
  • k the index of the play of arm ii

The idea is that a gambler iteratively plays rounds, obervaing the reward from the arm after each round, and can adjust their strategy each time. The aim is to maximize the sum of the rewards collected over all rounds.

ϵ\epsilon-greedy stragegy

  • With probability 1ϵ1-\epsilon wo choose the arm with maximum Q value arg maxaQ(a)\argmax_a Q(a) (exploit).
  • With brobability ϵ\epsilon we choose a random arm with uniform probability (explore)
alt text

Softmax strategy

eQ(s,a)/τb=1neQ(s,b)/τ\frac{e^{Q(s, a) / \tau}}{\sum_{b=1}^{n} e^{Q(s, b) / \tau}}

Softmax is a probability matching strategy, which means that the probability of each action being chosen is dependent on its Q-value so far.

UCB1 (upper condifence bounds) Using the UCB1 strategy, we select the next action using the following:

arg maxa(Q(a)+2lntN(a))\argmax_a \left( Q(a)+ \sqrt{\frac{2 \ln t}{N(a)}} \right )

where tt is the number of rounds so far, and N(a)N(a) is the number of times aa has been chosen in all previous arounds.

  • The left term encoueages exploitation: the Q-value is high for actions that have had a high reward
  • The right term encourages exploration: it is high for actions that have been explored less, that is, when N(a)N(a) relative to other actions. As tt increases, if some actions have low N(a)N(a), then the expression 2lntN(a)\sqrt{\frac{2 \ln t}{N(a)}} is large compared to actions with higher N(a)N(a)
alt text

Model-based vs model-free

Value iteration is part of a class of solutions known as model-based techniques. This means that we need to know the model, in particular, we have access to Pa(ss)P_a(s'|s) and r(s,a,s)r(s, a, s').

How can we calculate a policy if we don't know the transitions and rewards? we learn through experience by trying actions and seeing what the result is, making this machine learning problem. We learn a value function or a policy directly.

Temporal-difference learning

Model-free reinforcement learning is learning a policy directly from experience and rewards. Q-learning and SARSA are two model-free approaches.

Q(s,a)Q(s,a)+α[r+γV(s)Q(s,a)]Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma V(s') - Q(s, a)]
  • α\alpha: Learning rate
  • rr: Reward
  • γ\gamma: discount factor

Q-learning: Off-policy temporal difference learning

V(s)max(Q(s,a))V(s') \leftarrow \max{(Q(s', a'))}

SARSA: On-policy temporal difference learning

V(s)Qπ(s,a)V(s') \leftarrow Q^{\pi}(s', a')

Q-learning vs SARSA

  • Q-Learning
    • Q-learning will converge to the optimal policy
    • But it can be 'unsafe' or risky during training.
  • SARSA
    • SARSA learns the safe policy, mayve not that optimal.
    • SARSA receibes a higher average reward via training.

The difference between SARSA and Q-learning is that Q-learning is off-policy learning, while Sarsa is on-policy learning. Essentially, this means that SARSA chooses its action using the same policy used to choose the previous action, and then uses this difference to update its Q-function; while Q-learning updated assuming that the next action would be the action with the maximum Q-value.

Q-learning is therefore "optimistic", in that when it updates, it assumes that in the next state, the greedy action will be chosen, even it may be that in the next step, the policy, such as ϵ\epsilon-greedy, will choose to explore an action other than the best.

SARSA instead knows the action that it will execute next when it performs the update, so will learn on the action whether it is best or not.

On-policy vs. off-policy: Why do we have both?

The main advantage of off-policy approaches is that they can use samples from sources other than their own policy. For example, off-policy agents can be given a set of episodes of behaviour from another agent, such as a human expert, and can learn a policy by demonstration.

n-step SARSA

Gtn=rt+γrt+1+γ2rt+2++γnQ(s,a)Q(st,at)Q(st,at)+α[GtnQ(st,at)]G_t^n = r_t+\gamma \cdot r_{t+1} + \gamma^2 \cdot r_{t+2}+\dots +\gamma^n \cdot Q(s',a') \\[6pt] Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha [G_t^n - Q(s_t, a_t)]

Shape reward

Q(s,a)Q(s,a)+α[r+F(s,s)additional reward+γmaxaQ(s,a)Q(s,a)]Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \underbrace{F(s, s')}_{\text{additional reward}} + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]

The purpose of the function is to given an additional reward F(s,s)F(s, s') when any action transitions from state ss to ss'. The function: FF providees heuristic domain knowledge to the problem that is typically manually programmed.

Potential based Reward Shaping

F(s,s)=γΦ(s)Φ(s)F(s,s') = \gamma \Phi(s') - \Phi(s)

Φ(s)\Phi(s) is the potential of state ss.
Important : discount factor γ\gamma on Φ(s)\Phi(s')

Linear Q-function Arroximation

The key idea is to approximate the Q-function using a linear combination of features and their weights. Instead of recording everything in detail.

The overall process:

  1. for the states, consider what are the features that dettermine its representation;
  2. during learning, perform updates based on the weights of features indtead of states; and
  3. estimate Q(s,a)Q(s,a) by summing the features and their weights.

Rrepresentation

  1. A feature vector, f(s,a)f(s,a) which is a vector of nAn \cdot |A| different features, where nn is the number of features and A|A| is the number of actions.

It is straightforward to construct n×An \times |A| state pair features from just nn state features.

fi,k(s,a)={fi(s)if a=ak0otherwise1in,1kAf_{i,k}(s,a) = \begin{cases} f_i(s) & \text{if } a = a_k \\ 0 & \text{otherwise} \quad 1 \leq i \leq n, 1 \leq k \leq |A| \end{cases}

This effectively results in A|A| different weight vectors:

f(s,a1)=(f1,a1(s,a)f2,a1(s,a)0000)f(s,a2)=(00f1,a2(s,a)f2,a2(s,a)00)f(s,a3)=(0000f1,a3(s,a)f2,a3(s,a))f(s,a_1) = \begin{pmatrix} f_{1,a_1}(s,a) \\ f_{2,a_1}(s,a) \\ 0 \\ 0 \\ 0 \\ 0 \\ \vdots \end{pmatrix} f(s,a_2) = \begin{pmatrix} 0 \\ 0 \\ f_{1,a_2}(s,a) \\ f_{2,a_2}(s,a) \\ 0 \\ 0 \\ \vdots \end{pmatrix} f(s,a_3) = \begin{pmatrix} 0 \\ 0 \\ 0 \\ 0 \\ f_{1,a_3}(s,a) \\ f_{2,a_3}(s,a) \\ \vdots \end{pmatrix} \ldots
  1. A weight vector ww of size n×An \times |A|: One weight for each feature-action pair. wiadefinestheweightofafeaturew_{i}^{a} defines the weight of a feature iforactionfor actiona$.

Q-values from linear Q-function

Given a feature vector ff and a weight vector ww, the Q-value of a state is a linear combinations of features and weights.

Q(s,a)=f1(s,a)w1a+f2(s,a)w2a++fn(s,a)wna=i=0nfi(s,a)wiaQ(s,a) = f_1(s,a) \cdot w_1^a + f_2(s,a) \cdot w_2^a + \ldots + f_n(s,a) \cdot w_n^a \\ = \sum_{i=0}^n f_i(s,a)w_i^a

Linear Q-function update

wiawia+α[r+γmaxaQ(s,a)Q(s,a)]fi(s,a)w_i^a \leftarrow w_i^a + \alpha[r + \gamma \max_{a'} Q(s', a') - Q(s,a)]f_i(s,a)

maxaQ(s,a)\max_{a'} Q(s', a') can be altered depand on the algorthm used.

  • A action only updates the weights corresponded to the features of the action.

Value updation for expectimax tree

V(t)=maxa{c,g}tchildren(t)Pa(tt)[r(t,a,t)+γV(s(t))]V(t) = \max_{a' \in \{c, g\}} \sum_{t' \in \text{children}(t)} P_{a}(t' \mid t) \left[r(t, a', t') + \gamma V(s(t'))\right] alt text