ML Perf Notes
About

Posts

  • May 26, 2025

    An illustrated deep-dive into how the compute and comms in TP+SP are overlapped using Async TP

  • Mar 30, 2025

    An illustrated deep-dive into Megatron-style tensor parallelism

  • Mar 23, 2025

    Reducing Activation Recomputation in Large Transformer Models

  • Mar 9, 2025

    Megatron-LM

  • Feb 23, 2025

    DeepSeek V3

  • Feb 9, 2025

    Zero Bubble Pipeline Parallelism

  • Jan 12, 2025

    Ring Attention with Blockwise Transformers for Near-Infinite Context

  • Dec 14, 2024

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

  • Nov 30, 2024

    An intro to GPU architecture, CUDA, NCCL, and common ML performance bottlenecks

subscribe via RSS

ML Perf Notes

  • ML Perf Notes
  • danielvegamyhre
  • daniel-vega-myhre-a799aba2
  • vega_myhre