Posts
-
An illustrated deep-dive into how the compute and comms in TP+SP are overlapped using Async TP
-
An illustrated deep-dive into Megatron-style tensor parallelism
-
Reducing Activation Recomputation in Large Transformer Models
-
Megatron-LM
-
DeepSeek V3
-
Zero Bubble Pipeline Parallelism
-
Ring Attention with Blockwise Transformers for Near-Infinite Context
-
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
-
An intro to GPU architecture, CUDA, NCCL, and common ML performance bottlenecks
subscribe via RSS