Posts
An illustrated deep-dive into Megatron-style tensor parallelism
Reducing Activation Recomputation in Large Transformer Models
Megatron-LM
DeepSeek V3
Zero Bubble Pipeline Parallelism
Ring Attention with Blockwise Transformers for Near-Infinite Context
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
An intro to GPU architecture, CUDA, NCCL, and common ML performance bottlenecks
subscribe via RSS