Foundation of LLMs
chatML
<|im_start|> — Start a message block
<|im_end|> — End a message block
<|action_start|> — Start an external tool invocation
<|action_end|> — End the tool block
<|interpreter|> — Token indicating the interpreter tool
<|plugin|> — Token marking plugins/tools
SFT
SFT doesn’t teach new facts - it teaches new behaviors. The model already knows about the world from pre-training; SFT teaches it how to be a helpful assistant using that knowledge.
Gradient accumulation is a technique where you add up gradients from several small mini‑batches before updating the model’s weights, so it feels like training with a bigger batch without needing more memory.
Parameter efficient fine tuning (PEFT)
set of techniques/umbrella term for adapting a pre trained model to a specific task by only tuning models much faster, with far less compute, memory and storage while keeping most of model’s knowledge intact. some techniques:
Preference alignment
process of training/adapting AI so its outputs match human preferenes, values and intentions
Any-to-Any models
these models have multiple encoders and then fuse embeddings to create a shared representation space. The decoders use shared latent space as input and decode into modality of choice
GRPO (Group relative policy optimization)
instead of learning a value funcction (critic) like PPO does, GRPO samples a group of responses for the same prompt scores them using a reward model computes relative advantages within the group updates the policy to prefer better responses relative to others