Reinforcement Learning

May 11, 2026·4 min read

RL is a framework for solving decision problems, by building agents that learn from the environment by interacting with it through trial and error and receiving rewards (positive and negative) as unique feedback

at each timestep t

agent recieves state S_t from environment agent takes action A_t environment transitions to next state S_t+1 environment provides a reward R_t+1 forms a loop: (s_t, a_t, r_t+1, s_t+1)

Reward hypothesis: agent’s goal is to maximize expected cumulative reward, aka expected return

markov decision process (MDP): our agent only needs current state to decide what action to take and not the history of all states and actions they took before

P(s_t+1|s_t, a_t)

state S: is a complete description of the state of the world action space: all possible actions in an environment discrete space: number of possible actions is finite continuous space: number of possible actions is infinite

Cumulative return is like this..

R_t = r_t+1 + r_t+2 + r_t+3 + …

R = cumulative return future rewards are uncertain, so we discount them

how to discount rewards?

apply discounting factor y [0, 1] at each reward typically 0.95-0.99

high y → agent cares about long term rewards

low y → agent cares about short term rewards

disccounted expected return

$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots$

task: instance of reinforcement learning problem. types:

Goal in reinforcement learning is to build an agent that selects actions which maximize expected cumulative (discounted) rewards. for this the agent must learn an optimal policy

Policy $\pi$: brain/rules which tells agent what action to take when it is in some state

policy e.g. government policy → what action to take under certain conditions

two approaches to learn most optimal policy:

types of value functions:

a value function is a way to measure “how good it is to be in a state” or “how good it is to take a certain action in a state”

in both the policy is derived from the value, its not learned directly

some strategies to train our value/policy function

Qlearning: off policy, value based RL goal is to find optimal policy indirectly by learning Q function does TD update after each step has a Q function Q(s, a) - quality of taking action a in state s Q stands for quality

Internally stored as a Q-table

this works well only for small discrete state spaces not scalable to continous state spaces/large state spaces also har cell ki Q value ko calculate krna parta he beforehand

off-policy: using a different policy for accting (inference) and different for updating (training)

on-policy: using the same policy for acting and updating

acting policy: which action do i take in the real environment at this moment

updating policy: when updating Q-values, which future action do i assume ill take future action could be exploratory or a optimal value

Deep Q learning (DQN)

use a deep neural network to approximate Q values Q(s, a) input state output Q values for all possible actions

loss function in DQN

$\mathcal{L} = \left(Q_{\text{target}} - Q_\theta(s, a)\right)^2$

instability (training doesnt reliabily converge to a good policy) is deep Q learning is caused by:

stabilization techniques: