6:["$","main",null,{"className":"canvas-post flex min-h-dvh max-h-dvh flex-col overflow-hidden","children":[["$","$L11",null,{"title":"Foundation of LLMs","publishedAt":"2026-05-11T18:41:30.323Z","readingMinutes":3,"pages":[{"page":{"id":"pg_d85ab9f22a9d","excerpt":"SMOL","canvas":{"bounds":{"width":800,"height":2420},"elements":[{"type":"textBox","id":"tb_e2c52f1da355","x":80,"y":80,"w":640,"z":0,"content":{"type":"doc","blocks":[{"type":"paragraph","content":[{"type":"text","text":"chatML"}],"attrs":"$undefined"}]},"h":24},{"type":"textBox","id":"tb_00caec217686","x":80,"y":128,"w":640,"z":1,"content":{"type":"doc","blocks":[{"type":"codeBlock","content":[{"type":"text","text":"<|im_start|> — Start a message block\n<|im_end|> — End a message block\n\n<|action_start|> — Start an external tool invocation\n<|action_end|> — End the tool block\n<|interpreter|> — Token indicating the interpreter tool\n<|plugin|> — Token marking plugins/tools\n\n"}],"attrs":{"language":"python"}}]},"h":112},{"type":"textBox","id":"tb_439b5d322838","x":80,"y":264,"w":640,"z":2,"content":{"type":"doc","blocks":[{"type":"heading","content":[{"type":"text","text":"SFT"}],"attrs":{"level":3}}]},"h":36},{"type":"textBox","id":"tb_e3aff2acbd39","x":80,"y":324,"w":640,"z":3,"content":{"type":"doc","blocks":[{"type":"paragraph","content":[{"type":"text","text":"SFT doesn’t teach new facts - it teaches new behaviors. The model already knows about the world from pre-training; SFT teaches it how to be a helpful assistant using that knowledge."}],"attrs":"$undefined"}]},"h":72},{"type":"textBox","id":"tb_d98aad74ab2c","x":80,"y":420,"w":640,"z":4,"content":{"type":"doc","blocks":[{"type":"paragraph","content":[{"type":"text","text":"Gradient accumulation","marks":{"bold":true}},{"type":"text","text":" is a technique where you add up gradients from several small mini‑batches before updating the model’s weights, so it feels like training with a bigger batch without needing more memory."}],"attrs":"$undefined"}]},"h":72},{"type":"textBox","id":"tb_2d2abecbc3cb","x":80,"y":516,"w":640,"z":5,"content":{"type":"doc","blocks":[{"type":"heading","content":[{"type":"text","text":"Parameter efficient fine tuning (PEFT)"}],"attrs":{"level":2}}]},"h":36},{"type":"textBox","id":"tb_4d84131dbcc6","x":80,"y":576,"w":640,"z":6,"content":{"type":"doc","blocks":[{"type":"paragraph","content":[{"type":"text","text":"set of techniques/umbrella term for adapting a pre trained model to a specific task by only tuning models much faster, with far less compute, memory and storage while keeping most of model’s knowledge intact. some techniques:"}],"attrs":"$undefined"}]},"h":96},{"type":"textBox","id":"tb_05dc015d4b80","x":80,"y":696,"w":640,"z":7,"content":{"type":"doc","blocks":[{"type":"orderedList","content":[{"type":"text","text":"LoRA (Low rank adaptation) instead of updating a full weight matrix W of a layer during fine-tuning, LoRA approximates the update with a product of two smaller matrices, typically A and B"},{"type":"text","text":"Traditional fine‑tuning:\nW' = W + ΔW\n\nLoRA fine‑tuning:\nW' = W + B·A\n"},{"type":"text","text":"for a weight matrix W of shape (dxk) LoRA introduces two small trainable matrices B of shape (dxr) A of shape (rxk) r is the rank hyperparameter and is chosen to be smaller than d and k (e.g. r=4, 8, 16)"},{"type":"text","text":"the product of B and A is the same shape as W but costs har fewer trainable params"},{"type":"text","text":"after finetuning add the approximate weights back to original model"},{"type":"text","text":"…"},{"type":"text","text":"…"}],"attrs":"$undefined"}]},"h":216},{"type":"textBox","id":"tb_2ab515f469e8","x":80,"y":936,"w":640,"z":8,"content":{"type":"doc","blocks":[{"type":"heading","content":[{"type":"text","text":"Preference alignment"}],"attrs":{"level":2}}]},"h":36},{"type":"textBox","id":"tb_a1f2847d6a3c","x":80,"y":996,"w":640,"z":9,"content":{"type":"doc","blocks":[{"type":"paragraph","content":[{"type":"text","text":"process of training/adapting AI so its outputs match human preferenes, values and intentions"}],"attrs":"$undefined"}]},"h":48},{"type":"textBox","id":"tb_d632654b4308","x":80,"y":1068,"w":640,"z":10,"content":{"type":"doc","blocks":[{"type":"orderedList","content":[{"type":"text","text":"RLHF: human annotators rank responses, a reward model is trained from those rankings, language model is fine tuned to maximize that reward"},{"type":"text","text":"you start with a pretrained model"},{"type":"text","text":"human labellers write/select high quality responses(model outputs) to prompts(model inputs)"},{"type":"text","text":"human labellers rank/score the human responses, you train a reward ML model that tries to predict the numeric scores generated by humans. once reward model is trained, the model assigns higher scores to outputs that humans liked more"},{"type":"text","text":"llm is then trained using an RL algorithm via PPO (proximal policy optimization) to maximize the reward model score"},{"type":"text","text":"the llm is then trained to behave more like a human"},{"type":"text","text":"DPO: instead of training the reward model, you give model pairs of responses where humans marked one as preferred, DPO then trains the model to make preferred output more likely and dispreffered less likely on similar prompts"},{"type":"text","text":"for each prompt, you have two responses y_l (loser), y_w (winner)"},{"type":"text","text":"for each pair, you compute model log-probabilities"},{"type":"text","text":"use a preference loss encourages model to prefer winning response"},{"type":"text","text":"Suppose for prompt:\n“What’s the capital of France?”\n\nYou have:\nModel response A: “Paris” (preferred)\nModel response B: “Marseille” (dispreferred)\n\nDPO will update the model so that:\n𝑃(Paris∣prompt)P(Paris∣prompt) increases\n𝑃(Marseille∣prompt)P(Marseille∣prompt) decreases\n"},{"type":"text","text":"ORPO (odds ratio preference optimzation)"},{"type":"text","text":"what usually happens is you perform SFT on relevant dataset of good respones apply DPO using human preference data to align with human preferences"},{"type":"text","text":"people apply sft first becuase sft teaches model basic instruction and output patterns first, that gives DPO a better foundation so preference optimization focuses on nuance rather than teaching model basic behaviours"},{"type":"text","text":"ORPO integrates alignment directly into the training loss using odds ratio term. It uses preference data directly during SFT with a loss term that reinforces a good response and penalizes bad ones simultaneously ORPO loss = SFT loss + lambda x odds ratio loss"}],"attrs":"$undefined"}]},"h":672},{"type":"textBox","id":"tb_7d2542e325e3","x":80,"y":1764,"w":640,"z":11,"content":{"type":"doc","blocks":[{"type":"horizontalRule","content":[]}]},"h":24},{"type":"textBox","id":"tb_2badfa9e9aff","x":80,"y":1812,"w":640,"z":12,"content":{"type":"doc","blocks":[{"type":"heading","content":[{"type":"text","text":"Any-to-Any models"}],"attrs":{"level":3}}]},"h":36},{"type":"textBox","id":"tb_bc143fffe1bb","x":80,"y":1872,"w":640,"z":13,"content":{"type":"doc","blocks":[{"type":"paragraph","content":[{"type":"text","text":"these models have multiple encoders and then fuse embeddings to create a shared representation space. The decoders use shared latent space as input and decode into modality of choice"}],"attrs":"$undefined"}]},"h":72},{"type":"textBox","id":"tb_3d51e3e93997","x":80,"y":1968,"w":640,"z":14,"content":{"type":"doc","blocks":[{"type":"horizontalRule","content":[]}]},"h":24},{"type":"textBox","id":"tb_3da9f5019f6a","x":80,"y":2016,"w":640,"z":15,"content":{"type":"doc","blocks":[{"type":"heading","content":[{"type":"text","text":"GRPO (Group relative policy optimization)"}],"attrs":{"level":3}}]},"h":36},{"type":"textBox","id":"tb_511633e4a145","x":80,"y":2076,"w":640,"z":16,"content":{"type":"doc","blocks":[{"type":"paragraph","content":[{"type":"text","text":"instead of learning a value funcction (critic) like PPO does, GRPO samples a group of responses for the same prompt scores them using a reward model computes relative advantages within the group updates the policy to prefer better responses relative to others"}],"attrs":"$undefined"}]},"h":96},{"type":"textBox","id":"tb_ffd150d88767","x":80,"y":2196,"w":640,"z":17,"content":{"type":"doc","blocks":[{"type":"orderedList","content":[{"type":"text","text":"sample a prompt"},{"type":"text","text":"generate N response using current policy"},{"type":"text","text":"score each response with a reward model"},{"type":"text","text":"compute relative advantage within the group"},{"type":"text","text":"update the policy using these relative advantages"},{"type":"text","text":"apply a KL penalty to keep thte model close to the base policy"}],"attrs":"$undefined"}]},"h":96},{"type":"textBox","id":"tb_2c4618afe3e3","x":80,"y":2316,"w":640,"z":18,"content":{"type":"doc","blocks":[{"type":"paragraph","content":[],"attrs":"$undefined"}]},"h":24}],"ink":{"strokes":[]}}},"canvas":{"bounds":{"width":800,"height":2420},"elements":[{"type":"textBox","id":"tb_e2c52f1da355","x":80,"y":80,"w":640,"h":24,"z":0,"content":{"type":"doc","blocks":[{"type":"paragraph","content":[{"type":"text","text":"chatML"}],"attrs":"$undefined"}]}},{"type":"textBox","id":"tb_00caec217686","x":80,"y":128,"w":640,"h":112,"z":1,"content":{"type":"doc","blocks":[{"type":"codeBlock","content":[{"type":"text","text":"<|im_start|> — Start a message block\n<|im_end|> — End a message block\n\n<|action_start|> — Start an external tool invocation\n<|action_end|> — End the tool block\n<|interpreter|> — Token indicating the interpreter tool\n<|plugin|> — Token marking plugins/tools\n\n"}],"attrs":{"language":"python"}}]}},{"type":"textBox","id":"tb_439b5d322838","x":80,"y":264,"w":640,"h":36,"z":2,"content":{"type":"doc","blocks":[{"type":"heading","content":[{"type":"text","text":"SFT"}],"attrs":{"level":3}}]}},{"type":"textBox","id":"tb_e3aff2acbd39","x":80,"y":324,"w":640,"h":72,"z":3,"content":{"type":"doc","blocks":[{"type":"paragraph","content":[{"type":"text","text":"SFT doesn’t teach new facts - it teaches new behaviors. The model already knows about the world from pre-training; SFT teaches it how to be a helpful assistant using that knowledge."}],"attrs":"$undefined"}]}},{"type":"textBox","id":"tb_d98aad74ab2c","x":80,"y":420,"w":640,"h":72,"z":4,"content":{"type":"doc","blocks":[{"type":"paragraph","content":[{"type":"text","text":"Gradient accumulation","marks":{"bold":true}},{"type":"text","text":" is a technique where you add up gradients from several small mini‑batches before updating the model’s weights, so it feels like training with a bigger batch without needing more memory."}],"attrs":"$undefined"}]}},{"type":"textBox","id":"tb_2d2abecbc3cb","x":80,"y":516,"w":640,"h":36,"z":5,"content":{"type":"doc","blocks":[{"type":"heading","content":[{"type":"text","text":"Parameter efficient fine tuning (PEFT)"}],"attrs":{"level":2}}]}},{"type":"textBox","id":"tb_4d84131dbcc6","x":80,"y":576,"w":640,"h":96,"z":6,"content":{"type":"doc","blocks":[{"type":"paragraph","content":[{"type":"text","text":"set of techniques/umbrella term for adapting a pre trained model to a specific task by only tuning models much faster, with far less compute, memory and storage while keeping most of model’s knowledge intact. some techniques:"}],"attrs":"$undefined"}]}},{"type":"textBox","id":"tb_05dc015d4b80","x":80,"y":696,"w":640,"h":216,"z":7,"content":{"type":"doc","blocks":[{"type":"orderedList","content":[{"type":"text","text":"LoRA (Low rank adaptation) instead of updating a full weight matrix W of a layer during fine-tuning, LoRA approximates the update with a product of two smaller matrices, typically A and B"},{"type":"text","text":"Traditional fine‑tuning:\nW' = W + ΔW\n\nLoRA fine‑tuning:\nW' = W + B·A\n"},{"type":"text","text":"for a weight matrix W of shape (dxk) LoRA introduces two small trainable matrices B of shape (dxr) A of shape (rxk) r is the rank hyperparameter and is chosen to be smaller than d and k (e.g. r=4, 8, 16)"},{"type":"text","text":"the product of B and A is the same shape as W but costs har fewer trainable params"},{"type":"text","text":"after finetuning add the approximate weights back to original model"},{"type":"text","text":"…"},{"type":"text","text":"…"}],"attrs":"$undefined"}]}},{"type":"textBox","id":"tb_2ab515f469e8","x":80,"y":936,"w":640,"h":36,"z":8,"content":{"type":"doc","blocks":[{"type":"heading","content":[{"type":"text","text":"Preference alignment"}],"attrs":{"level":2}}]}},{"type":"textBox","id":"tb_a1f2847d6a3c","x":80,"y":996,"w":640,"h":48,"z":9,"content":{"type":"doc","blocks":[{"type":"paragraph","content":[{"type":"text","text":"process of training/adapting AI so its outputs match human preferenes, values and intentions"}],"attrs":"$undefined"}]}},{"type":"textBox","id":"tb_d632654b4308","x":80,"y":1068,"w":640,"h":672,"z":10,"content":{"type":"doc","blocks":[{"type":"orderedList","content":[{"type":"text","text":"RLHF: human annotators rank responses, a reward model is trained from those rankings, language model is fine tuned to maximize that reward"},{"type":"text","text":"you start with a pretrained model"},{"type":"text","text":"human labellers write/select high quality responses(model outputs) to prompts(model inputs)"},{"type":"text","text":"human labellers rank/score the human responses, you train a reward ML model that tries to predict the numeric scores generated by humans. once reward model is trained, the model assigns higher scores to outputs that humans liked more"},{"type":"text","text":"llm is then trained using an RL algorithm via PPO (proximal policy optimization) to maximize the reward model score"},{"type":"text","text":"the llm is then trained to behave more like a human"},{"type":"text","text":"DPO: instead of training the reward model, you give model pairs of responses where humans marked one as preferred, DPO then trains the model to make preferred output more likely and dispreffered less likely on similar prompts"},{"type":"text","text":"for each prompt, you have two responses y_l (loser), y_w (winner)"},{"type":"text","text":"for each pair, you compute model log-probabilities"},{"type":"text","text":"use a preference loss encourages model to prefer winning response"},{"type":"text","text":"Suppose for prompt:\n“What’s the capital of France?”\n\nYou have:\nModel response A: “Paris” (preferred)\nModel response B: “Marseille” (dispreferred)\n\nDPO will update the model so that:\n𝑃(Paris∣prompt)P(Paris∣prompt) increases\n𝑃(Marseille∣prompt)P(Marseille∣prompt) decreases\n"},{"type":"text","text":"ORPO (odds ratio preference optimzation)"},{"type":"text","text":"what usually happens is you perform SFT on relevant dataset of good respones apply DPO using human preference data to align with human preferences"},{"type":"text","text":"people apply sft first becuase sft teaches model basic instruction and output patterns first, that gives DPO a better foundation so preference optimization focuses on nuance rather than teaching model basic behaviours"},{"type":"text","text":"ORPO integrates alignment directly into the training loss using odds ratio term. It uses preference data directly during SFT with a loss term that reinforces a good response and penalizes bad ones simultaneously ORPO loss = SFT loss + lambda x odds ratio loss"}],"attrs":"$undefined"}]}},{"type":"textBox","id":"tb_7d2542e325e3","x":80,"y":1764,"w":640,"h":24,"z":11,"content":{"type":"doc","blocks":[{"type":"horizontalRule","content":[],"attrs":"$undefined"}]}},{"type":"textBox","id":"tb_2badfa9e9aff","x":80,"y":1812,"w":640,"h":36,"z":12,"content":{"type":"doc","blocks":[{"type":"heading","content":[{"type":"text","text":"Any-to-Any models"}],"attrs":{"level":3}}]}},{"type":"textBox","id":"tb_bc143fffe1bb","x":80,"y":1872,"w":640,"h":72,"z":13,"content":{"type":"doc","blocks":[{"type":"paragraph","content":[{"type":"text","text":"these models have multiple encoders and then fuse embeddings to create a shared representation space. The decoders use shared latent space as input and decode into modality of choice"}],"attrs":"$undefined"}]}},{"type":"textBox","id":"tb_3d51e3e93997","x":80,"y":1968,"w":640,"h":24,"z":14,"content":{"type":"doc","blocks":[{"type":"horizontalRule","content":[],"attrs":"$undefined"}]}},{"type":"textBox","id":"tb_3da9f5019f6a","x":80,"y":2016,"w":640,"h":36,"z":15,"content":{"type":"doc","blocks":[{"type":"heading","content":[{"type":"text","text":"GRPO (Group relative policy optimization)"}],"attrs":{"level":3}}]}},{"type":"textBox","id":"tb_511633e4a145","x":80,"y":2076,"w":640,"h":96,"z":16,"content":{"type":"doc","blocks":[{"type":"paragraph","content":[{"type":"text","text":"instead of learning a value funcction (critic) like PPO does, GRPO samples a group of responses for the same prompt scores them using a reward model computes relative advantages within the group updates the policy to prefer better responses relative to others"}],"attrs":"$undefined"}]}},{"type":"textBox","id":"tb_ffd150d88767","x":80,"y":2196,"w":640,"h":96,"z":17,"content":{"type":"doc","blocks":[{"type":"orderedList","content":[{"type":"text","text":"sample a prompt"},{"type":"text","text":"generate N response using current policy"},{"type":"text","text":"score each response with a reward model"},{"type":"text","text":"compute relative advantage within the group"},{"type":"text","text":"update the policy using these relative advantages"},{"type":"text","text":"apply a KL penalty to keep thte model close to the base policy"}],"attrs":"$undefined"}]}},{"type":"textBox","id":"tb_2c4618afe3e3","x":80,"y":2316,"w":640,"h":24,"z":18,"content":{"type":"doc","blocks":[{"type":"paragraph","content":[],"attrs":"$undefined"}]}}],"ink":"$6:props:children:0:props:pages:0:page:canvas:ink"},"degraded":false},{"page":{"id":"pg_073e7a8fd458","excerpt":"","canvas":{"bounds":{"width":800,"height":1600},"elements":[{"type":"textBox","id":"tb_81d4eb3ffd37","x":80,"y":80,"w":640,"z":0,"content":{"type":"doc","blocks":[{"type":"paragraph","content":[],"attrs":"$undefined"}]},"h":24}],"ink":{"strokes":[]}}},"canvas":{"bounds":{"width":800,"height":1600},"elements":[{"type":"textBox","id":"tb_81d4eb3ffd37","x":80,"y":80,"w":640,"h":24,"z":0,"content":{"type":"doc","blocks":[{"type":"paragraph","content":[],"attrs":"$undefined"}]}}],"ink":"$6:props:children:0:props:pages:1:page:canvas:ink"},"degraded":false}],"canvasDegraded":false,"hideHeader":false}],"$L12"]}]