SARSA (State-Action-Reward-State-Action) is a reinforcement learning algorithm that is used to learn an optimal policy for interacting with an environment. It is a variation of the Q-learning algorithm, with the main difference being that it uses the action chosen by the current policy (instead of the optimal action) to update the Q-values.
Like Q-learning, SARSA uses a Q-table to store the expected reward for taking a particular action in a given state. At each time step, the algorithm takes an action based on the current state and the current policy, and then updates the Q-values based on the reward received and the next state and action.
Here is an example of how SARSA works:
In this equation, \alpha is the learning rate, which determines how much the Q-value should be updated at each time step, and \gamma is the discount factor, which determines how much future rewards are taken into account when updating the Q-value.
SARSA can be implemented using a simple loop that iterates over the states and actions, and updates the Q-values based on the rewards and transitions observed. The Q-values can then be used to define the policy, which determines which action to take in each state.
import numpy as np
# Initialize the Q-function with arbitrary values.
= np.zeros((5, 5))
Q # Set the learning rate and discount factor.
= 0.1
alpha = 0.9
gamma # Loop through a fixed number of episodes.
for episode in range(1000):
# Set the initial state.
= 0
state # Choose an action based on the current state and the current policy.
= np.argmax(Q[state])
action # Loop until the episode is done.
while True:
# Take the action and receive the reward and next state from the environment.
= env.step(state, action)
reward, next_state # Choose the next action based on the current policy and the next state.
= np.argmax(Q[next_state])
next_action # Update the Q-function.
= Q[state][action] + alpha * (reward + gamma * Q[next_state][next_action] - Q[state][action])
Q[state][action] # Set the next state as the current state.
= next_state
state = next_action
action # If the episode is done, break the loop.
if env.is_done(state):
break
Comparing the the Q-learning demo, here we update the Q-function
using the SARSA update equation, which takes into account the action
chosen by the current policy (next_action
) instead of the
optimal action. We also added a loop to iterate over the states and
actions until the episode is done.