PyTorch Conference 2025: MXFP8 Training for MoEs with TorchAO

At PyTorch Conference 2025 I co-presented a talk on PyTorch APIs for High Performance MoE Training and Inference. My part of the talk, titled MXFP8 Training for MoEs with TorchAO, can be viewed on YouTube below. I plan to follow up with a blog post diving into more detail and doing a code tour for some of the more interesting Triton and CUDA kernels as well! In particular, writing the e8m0 scale factors to the blocked layout (a requirement for tcgen05.mma.* PTX instructions on Blackwell) is a bit unintuitive and could benefit from a more detailed walkthrough.

In the meantime, enjoy the talk! (Note you may need to disable ad blocker for the youtube player to render properly. Alternatively, you can watch directly on YouTube here).

The prototype for MXFP8 MoE training, with documentation, examples, and reproducible benchmarks, can be found here. Feel free to reach out with any questions!