LLMs Optimizations

Tiling technique

1. instead of forming the entire QK^T matrix in memory, FA computes it tile by tile,
Q is split into blocks of 128 queries
K and V into blocks of 128 key/values
2. load a tile of keys and values into shared memory,
compute partial attention scores on the fly (online softmax)
multiply corresponding value tiles
accumulate partial results
3. mode on to next key-value tile, reuse shared memory and repeat

"The sky is blue."  → 4 tokens
"Hello"             → 1 token
"A long paragraph"  → 3 tokens

we pack them into a same sequence.
Packed sequence: [The, sky, is, blue, Hello, A, long, paragraph]

~ each element is a vector of gradients
GPUA: [a1, a2, a3, a4]
GPUB: [b1, b2, b3, b4]
GPUC: [c1, c2, c3, c4]
GPUD: [d1, d2, d3, d4]

after 1 iteration
GPUA: [a1, a2, a3, a4 + d4]
GPUB: [a1 + b1, b2, b3, b4]
GPUC: [c1, c2 + b2, c3, c4]
GPUD: [d1, d2, d3 + c3, d4]

after 3 iterations

GPUA: [a1 + b1 + c1 + d1,          -        ,          -        ,        -         ]
GPUB: [       -         , a1 + b1 + c1 + d1 ,          -        ,        -         ]
GPUC: [       -         ,          -        , a1 + b1 + c1 + d1 ,        -         ]
GPUD: [       -         ,          -        ,          -        , a1 + b1 + c1 + d1]

LLMs Optimizations

Flash Attention

Tiling technique

Online softmax

Results

GPU memory Architecture

Multi Query Attention

Grouped Query Attention

Activation checkpointing

Sequence packing

Inference optimization technique

KV caching

KV caching optimizations

Stateful caching

Speculative decoding

Quantization techniques

Quantization Types

Post Training Quantization

Mixed precision Quantization

Quantization aware training

Training optimization

Data parallelism

Synchronization approaches

Bulk synchronous Parallel (BSP)

Asynchronous Parallel (ASP)

Distributed Data parallel (DDP)

Ring all reduce algorithm