Reinforcement Learning
RL is a framework for solving decision problems, by building agents that learn from the environment by interacting with it through trial and error and receiving rewards (positive and negative) as unique feedback
at each timestep t
agent recieves state S_t from environment agent takes action A_t environment transitions to next state S_t+1 environment provides a reward R_t+1 forms a loop: (s_t, a_t, r_t+1, s_t+1)
Reward hypothesis: agent’s goal is to maximize expected cumulative reward, aka expected return
markov decision process (MDP): our agent only needs current state to decide what action to take and not the history of all states and actions they took before
P(s_t+1|s_t, a_t)
state S: is a complete description of the state of the world action space: all possible actions in an environment discrete space: number of possible actions is finite continuous space: number of possible actions is infinite
Cumulative return is like this..
R_t = r_t+1 + r_t+2 + r_t+3 + …
R = cumulative return future rewards are uncertain, so we discount them
how to discount rewards?
apply discounting factor y [0, 1] at each reward typically 0.95-0.99
high y → agent cares about long term rewards
low y → agent cares about short term rewards
disccounted expected return
$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots$
task: instance of reinforcement learning problem. types:
Goal in reinforcement learning is to build an agent that selects actions which maximize expected cumulative (discounted) rewards. for this the agent must learn an optimal policy
Policy $\pi$: brain/rules which tells agent what action to take when it is in some state
policy e.g. government policy → what action to take under certain conditions
two approaches to learn most optimal policy:
types of value functions:
a value function is a way to measure “how good it is to be in a state” or “how good it is to take a certain action in a state”
in both the policy is derived from the value, its not learned directly
some strategies to train our value/policy function
Qlearning: off policy, value based RL goal is to find optimal policy indirectly by learning Q function does TD update after each step has a Q function Q(s, a) - quality of taking action a in state s Q stands for quality
Internally stored as a Q-table
this works well only for small discrete state spaces not scalable to continous state spaces/large state spaces also har cell ki Q value ko calculate krna parta he beforehand
off-policy: using a different policy for accting (inference) and different for updating (training)
on-policy: using the same policy for acting and updating
acting policy: which action do i take in the real environment at this moment
updating policy: when updating Q-values, which future action do i assume ill take future action could be exploratory or a optimal value
Deep Q learning (DQN)
use a deep neural network to approximate Q values Q(s, a) input state output Q values for all possible actions
loss function in DQN
$\mathcal{L} = \left(Q_{\text{target}} - Q_\theta(s, a)\right)^2$
instability (training doesnt reliabily converge to a good policy) is deep Q learning is caused by:
stabilization techniques: