ML Perf Notes
Blog
Talks
About
Blog
Posts
Feb 2, 2026
Debugging deadlocks in warp-specialized GEMM kernels with CUDA-GDB
May 26, 2025
An illustrated deep-dive into how the compute and comms in TP+SP are overlapped using Async TP
Mar 30, 2025
An illustrated deep-dive into Megatron-style tensor parallelism
subscribe
via RSS