ML Perf Notes

Posts

May 26, 2025
An illustrated deep-dive into how the compute and comms in TP+SP are overlapped using Async TP
Mar 30, 2025
An illustrated deep-dive into Megatron-style tensor parallelism
Mar 23, 2025
Reducing Activation Recomputation in Large Transformer Models
Mar 9, 2025
Megatron-LM
Feb 23, 2025
DeepSeek V3
Feb 9, 2025
Zero Bubble Pipeline Parallelism
Jan 12, 2025
Ring Attention with Blockwise Transformers for Near-Infinite Context
Dec 14, 2024
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Nov 30, 2024
An intro to GPU architecture, CUDA, NCCL, and common ML performance bottlenecks