ResourcesLecture 1: Overview and TokenizationTake-awaysThe bitter lessonLevels of opennessEfficiency drives design decisionsByte Pair Encoding - BPE TokenizationAssignment 1: BasicsLecture 5: GPUsComparasion between CPUs and GPUsAnatomy of GPUsExecution UnitsMemoryExecution model of GPUsMemory model of GPUsReduce memory accessFlashAttentionLecture 6: Kernels, TritonProfilingtorch.profiler.profilensys profileKernelsFlashAttention-2Lecture 7: Parallelism 1Collective OperationsLecture 15: Alignment - SFT/RLHFTransition from imitation to optimizationDataRLHFPolicy Gradient OptimizationReducing VarianceEstimation of AdvantagePPO (Proximal Policy Optimization)Loss functionsThe Token-per-Token Process for Training LLMsPPO in practiceDPO (Direct Preference Optimization)Lecture 16: Alignment - RL 1Things to wach out for in RLHFOver-optimization / over-fittingGRPOUnbiased version: Dr.GRPOCase studiesDeepSeek-R1Kimi K1.5Qwen3
Resources
‣
‣
Lecture 1: Overview and Tokenization
Take-aways
- Mechanics: how things work (what a Transformer is, how model parallelism leverages GPUs)
- Mindset: squeezing the most out of the hardware, taking scale seriously (scaling laws)
- Intuitions: which data and modeling decisions yield good accuracy
- Some design decisions are simply not (yet) justifiable and just come from experimentation.
- Example: Noam Shazeer paper that introduced SwiGLU [Shazeer 2020]
The bitter lesson
- Wrong interpretation: scale is all that matters, algorithms don't matter.
- Right interpretation: algorithms that scale is what matters.
Levels of openness
- Closed models (e.g., GPT-4o): API access only [OpenAI+ 2023]
- Open-weight models (e.g., DeepSeek): weights available, paper with architecture details, some training details, no data details [DeepSeek-AI+ 2024]
- Open-source models (e.g., OLMo): weights and data available, paper with most details (but not necessarily the rationale, failed experiments) [Groeneveld+ 2024]
Efficiency drives design decisions
Today, we are compute-constrained, so design decisions will reflect squeezing the most out of given hardware.
- Data processing: avoid wasting precious compute updating on bad / irrelevant data
- Tokenization: working with raw bytes is elegant, but compute-inefficient with today's model architectures.
- Model architecture: many changes motivated by reducing memory or FLOPs (e.g., sharing KV caches, sliding window attention)
- Training: we can get away with a single epoch!
- Scaling laws: use less compute on smaller models to do hyperparameter tuning
- Alignment: if tune model more to desired use cases, require smaller base models
Tomorrow, we will become data-constrained...
Byte Pair Encoding - BPE Tokenization
The BPE algorithm was introduced by Philip Gage in 1994 for data compression. [article]
It was adapted to NLP for neural machine translation. [Sennrich+ 2015]
(Previously, papers had been using word-based tokenization.)
BPE was then used by GPT-2. [Radford+ 2019]
Basic idea: train the tokenizer on raw text to automatically determine the vocabulary.
Intuition: common sequences of characters are represented by a single token, rare sequences are represented by many tokens.
def train_bpe(string: str, num_merges: int) -> BPETokenizerParams: # @inspect string, @inspect num_merges # Start with the list of bytes of string. indices = list(map(int, string.encode("utf-8"))) # @inspect indices merges: dict[tuple[int, int], int] = {} # index1, index2 => merged index vocab: dict[int, bytes] = {x: bytes([x]) for x in range(256)} # index -> bytes for i in range(num_merges): # Count the number of occurrences of each pair of tokens counts = defaultdict(int) for index1, index2 in zip(indices, indices[1:]): # For each adjacent pair counts[(index1, index2)] += 1 # @inspect counts # Find the most common pair. pair = max(counts, key=counts.get) # @inspect pair index1, index2 = pair # Merge that pair. new_index = 256 + i # @inspect new_index merges[pair] = new_index # @inspect merges vocab[new_index] = vocab[index1] + vocab[index2] # @inspect vocab indices = merge(indices, pair, new_index) # @inspect indices return BPETokenizerParams(vocab=vocab, merges=merges)
Assignment 1: Basics
Lecture 5: GPUs
Comparasion between CPUs and GPUs
- CPUs optimize for latency, where each thread finishes quickly
- GPUs optimize for throughput (total processed data per unit time)
Anatomy of GPUs
Execution Units
- SM (streaming multiprocessors): independently execute “blocks” (jobs), e.g., GA100 has 128 SMs
- Each SM contains many SPs (streaming processors) that can execute “threads” in parallel
Memory
The closer the memory to the SM, the faster it is
- Shared memory and L1 are inside the SM
- L2 cache is on die
- Global memory are memory chips next to GPUs
Execution model of GPUs
- Threads: Threads do the same work in parallel but with different inputs (SIMT)
- Blocks: Block is a groups of threads that runs on a SM with its own shared memory
- Warps: Threads always execute in a warp of 32 consecutively numbered threads each
Memory model of GPUs
Hierarchy (from fastest to slowest)
- Register
- fastest storage on GPU
- Every thread has its own register file
- latency: 1 or few cycle(s)
- Size: 64k–256k per SM, shared by all threads within the SM
- Shared memory (SMEM)
- Every SM has a fast, on-chip SRAM that can be used as L1 cache and shared memory.
- can be manually allocated by developers
- All threads in a CUDA block can share shared memory
- L1 cache
- auto-managed cache by hardware
- latency: 10-100 cycles
- size: 192KB in A100
- Programming model: shared memory / L1 cache
- Physical implementation: SRAM = Static Random Access Memory, on-chip
- L2 cache
- shared across all SMs
- connect all SMs to global memory
- latency: ~200 cycles
- size: V100 6MB → A100 40 MB
- Global memory
- latency: >1000 cycles
- size: V100 32GB → A100 40/80GB → H100 80/94GB → H200 141GB
- Physical implementation: DRAM = Dynamic Random Access Memory, off-chip
TPUs
- TPUs are specially optimized for Matrix Multiplication with MXU
- TensorCore: similar to SM in GPU
Scaling
- Compute scaling is faster than memory scaling
Reduce memory access
- Compute and store values in low precision, e.g., F32 → F16/BF16
- Kernel / Operator Fusion (
torch.compile)
- Recomputation when doing back-propagation
- Memory coalescing and DRAM
- DRAM (global memory) is read in “burst mode”, where each read gives many bytes
- Coalscing when reading from row-major matrix
- Tiling
FlashAttention
Online Softmax
Lecture 6: Kernels, Triton
Profiling
torch.profiler.profile
def profile(description: str, run: Callable, num_warmups: int = 1, with_stack: bool = False): # Warmup for _ in range(num_warmups): run() if torch.cuda.is_available(): torch.cuda.synchronize() # Wait for CUDA threads to finish (important!) # Run the code with the profiler with torch.profiler.profile( activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], # Output stack trace for visualization with_stack=with_stack, # Needed to export stack trace for visualization experimental_config=torch._C._profiler._ExperimentalConfig(verbose=True)) as prof: run() if torch.cuda.is_available(): torch.cuda.synchronize() # Wait for CUDA threads to finish (important!) # Print out table table = prof.key_averages().table(sort_by="cuda_time_total", max_name_column_width=80, row_limit=10) #text(f"## {description}") #text(table, verbatim=True) # Write stack trace visualization if with_stack: text_path = f"var/stacks_{description}.txt" svg_path = f"var/stacks_{description}.svg" prof.export_stacks(text_path, "self_cuda_time_total") return table
def run_operation1(dim: int, operation: Callable) -> Callable: # Setup: create one random dim x dim matrices x = torch.randn(dim, dim, device=get_device()) # Return a function to perform the operation return lambda : operation(x) def run_operation2(dim: int, operation: Callable) -> Callable: # Setup: create two random dim x dim matrices x = torch.randn(dim, dim, device=get_device()) y = torch.randn(dim, dim, device=get_device()) # Return a function to perform the operation return lambda : operation(x, y)
add_function = lambda a, b: a + b add_profile = profile("add", run_operation2(dim=2048, operation=add_function)) matmul_function = lambda a, b: a @ b matmul_profile = profile("matmul", run_operation2(dim=2048, operation=matmul_function))
nsys profile
uv run nsys profile -o result python cs336_systems/benchmark.py
Kernels
Take GeLU as an example:
FlashAttention-2
Lecture 7: Parallelism 1
Collective Operations
‣
Broadcast
All ranks receive data from a “root/source” rank.
Reduce
One rank receives the reduction of input values across ranks.
Gather
Root rank receives data from all ranks.
Scatter
Root rank distributes data to all ranks.
All Reduce
Each rank receives the reduction of input values across ranks.
All Gather
Each rank receives the aggregation of data from all ranks in the order of the ranks.
Reduce Scatter
Same operation as Reduce, except that the result is scattered in equal-sized blocks between ranks, each rank getting a chunk of data based on its rank index.
Reduce Scatter followed by All Gather is equivalent to All Reduce
Parallel for training
‣
Lecture 15: Alignment - SFT/RLHF
Transition from imitation to optimization
- Imitation (SFT)
- Fit for some reference distribution
- Pure generative modeling (language modeling) perspective
- Requires samples from reference policy
- Optimization (RLHF)
- Find such that for reward
- Maximize some reward function that we can measure
- LMs are policies, not a model of some (language) distribution
Data
Pairwise feedbacks
Off-policy / On-policy data
- Off-policy: The training relies on pre-collected data (or data produced by another policy), eliminating the need for real-time generation.
- it tells what the general landscape looks like
- like learn playing the chess with a coach beside and providing real-time feedbacks
- highly sensitive to how well the pre-collected data aligns with the model’s current capabilities
- On-policy: During training, the model actively generates its own data samples.
- it tells how to refine the model itself
- like learn playing chess by watching a collection of recordings
- can be more computationally demanding and time-consuming
- offer a higher theoretical performance ceiling because they continuously explore and update based on the model’s current state
RLHF
‣
‣
Policy Gradient Optimization
- Reinforcement learning optimization objective
- is the policy (model), with parameters
- is the return of a policy (over all possible trajectories)
- , the trajectory, is a sequence of states and corresponding actions
- The next state is governed by a probability distribution
- Thus, the probability of a trajectory is given by
- The total return for a trajectory is defined as
- where is the discount factor and is the reward received at time
- we discount future rewards because they are never as valuable as immediate ones
- Since our goal here is to maximize the return, we use stochastic gradient ascent to update the policy
- here is known as policy gradient
- In practice, it’s infeasible to compute the gradient of above objective:
- firstly, it’s impossible to traverse all possible trajectories to obtain an accurate gradient
- we can do Monte Carlo sampling by rolling out a subset of trajectories as an estimation (just like SGD)
- secondly, the long trajectory lengths make auto-differentiation very memory intensive (we are calculating the gradient for the whole trajectory)
- we can derive a practical expression for the policy gradient so we can calculate the gradient for each step (state + action), instead of the whole trajectory all at once
- expand the expression of full expectation
- Interchange the Gradient and the Integral
- Apply the log-derivative trick
- Return to Expectation Form
- Decomposing .
This is the key step to decomposes the influence of each move.
Note the gradients of and vanish because they are not determined by .
- Now we have the final formula for policy gradient calculation
- With Monte Carlo sampling, the policy gradient can be approximated as
Algorithm Characteristics
- Model-free: This method relies solely on your actual gameplay experience and does not require any prior knowledge of the opponent’s strategy
Challenges
- We must sample enough amount of trajectories to obtain an accurate gradient estimate; otherwise, the variance can be high.
- We now attempt to credit every move with the entire trajectory’s result, so the evaluation becomes extremely unstable.
- This is not reasonable since a decision can only influence future outcomes
Reducing Variance
- In practice, when evaluating the current move, you only need to consider the “future rewards”. This concept is known as rewards-to-go.
- We can also subtract a baseline (typically a function of the state , like value function ) from rewards.
- The advantage reflects the amount by which the actual return exceeds the baseline (or an average move).
- Now if the state is favorable, a poor move can still reduce the relative advantage, which makes more sense.
- The idea is to decrease the variance of the estimator by subtracting a term that is correlated with it, without introducing bias.
- In general, the expectation of the score function is zero, so we conclude that the baselined policy gradient is unbiased
- Introducing Q-function, V-function and advantage function
- : captures the future return when taking action in state .
- : a baseline of the state value
- : advantage function. It tells you how much more likely a specific move in a given state is to improve your chances of winning compared to an average move.
Estimation of Advantage
We can sum over several steps as an estimation of advantage
- If stop too early, it is not accurate enough and it introduces high bias
- If accumulate too many steps, it leads to high variance
Generalized Advantage Estimation (GAE)
PPO (Proximal Policy Optimization)
- On-policy
- Reference Model: A unique element in PPO for large models, it prevents the Actor from straying too far from the original pre-trained distribution, thereby mitigating issues such as reward hacking.
Loss functions
- Policy Loss
This component ensures that we are fine-tuning our policy, rather than taking reckless steps that could possibly disrupt the overall structure
- Value Function Loss
This term is to fit the Value function, so that for every state, your estimated expected return is as close as possible to the actual returns received.
- Entropy Loss
This term encourages the policy to keep exploring, prevents the policy from becoming too deterministic.
The Token-per-Token Process for Training LLMs
State: current context (prompt + past generated tokens)
Action: generate a new token according to previous context
- Sampling Phase
θ_old = PretrainedLLM.parameters # Sampling Phase: Generate a batch of data using θ_old for each prompt in dataset: trajectory = [] state = prompt while not end_of_sequence: # token-per-token generation loop token, logpi_old = θ_old.generate_token(state) # Record the current state, token, and the log-probability under θ_old trajectory.append( (state, token, logpi_old, reward, V(state)) ) state = state + token # Update the state (append token) store trajectory
- Computing the Advantage
# Compute Advantages (e.g., using GAE) for each trajectory: for t from last token downto first: δ_t = reward[t] + γ * V(state[t+1]) - V(state[t]) A_t = δ_t + γ * λ * A[t+1] # Recursively compute the advantage
- Updating Phase
# PPO Update Phase: Multiple epochs for each PPO update epoch: for each token data (state s_t, token a_t, logpi_old, A_t) in batch: # 1. Compute the log-probability under the current policy logpi_current = θ.log_probability(s_t, a_t) # 2. Calculate the probability ratio r_t = exp( logpi_current - logpi_old ) # 3. Compute the unclipped and clipped objectives loss_unclipped = r_t * A_t loss_clipped = clip(r_t, 1-ε, 1+ε) * A_t # 4. The token loss is the negative of the minimum of these two values loss_token = -min(loss_unclipped, loss_clipped) # 5. Average the loss over all tokens and perform a gradient update θ = Update(θ, average(loss_token over batch)) # After updating, copy θ to θ_old for the next round of sampling θ_old = copy(θ)
PPO in practice
- reward shaping (?
DPO (Direct Preference Optimization)
- off-policy (could be on-policy by iterating, but people don’t usually do that)
- requires pairwised data (but most data inherently are not in such way)
Lecture 16: Alignment - RL 1
Things to wach out for in RLHF
Over-optimization / over-fitting
GRPO
- starts with PPO, but much simpler because value funtion / adavantage computation are removed
- compute the advantage as “z-score within group”
- For each input, we will have rollouts to form a group
Unbiased version: Dr.GRPO
- The length normalization makes the model outputs tend to be long (especially when the mode cannot generate the correct answer, so it make the output long to to make the negative reward small)
- The standard deviation normalization upweights the samples where std is small, e.g., questions that are too easy (so outputs are all good) or too hard (so outputs are all bad)
- and the Dimension goes wrong with stdev
Case studies
DeepSeek-R1
Opens up a recipe that not only exceeds OpenAI o1, but is also simple
- SFT initialization → RLVR/reasoning RL → SFT/RLHF
A bit overstated
- Output length increasing because of bias introduced by length normalization
- Base model (DeepSeek-V3) already has “Aha” moment, so it doesn’t necessarily emerge from RL
Contributions
- Proved that GRPO (outcome-based rewards) works
- PRMs (Process Reward Model) and MCTS (Monte Carlo Tree Search) cannot replicate o1 performance
Kimi K1.5
Recipe: Data curation + SFT →
Data curation
- tag question and make the distribution balanced
- assign difficulty labels to the dataset and go from easy to difficult
Policy Optimization
Similar to GRPO in terms of:
- reward with reasonable baselines
- kimi doesn’t use the stdev
- regularization
Length control / penalty
- Compress CoTs for better inference efficiency
- the loss
- when correct, promote short answers and penalize long answers
- when incorrect, only prompte asnwers that are shorter than the average
- the loss is not enable at the very beginning to avoid stalling LLM
- LLM can’t make the answer correct anyway, so just keep the answer short to get more rewards
RL infra
- Rollouts make inference slow
- separated rollout workers and training workers
- Long CoTs make batches uneven
- etc.
Qwen3
Reasoning RL (Stage 2)
- with only 3995 examples
Thinking Mode Fusion
Allow a single set of model weights to do both thinking and non-thinking
- The thinking process can be halted by using following sentence:
"Considering the limited time by the user, I have to give the solution based on the thinking directly now.\n</think>.\n\n"
