For session 2 of the EleutherAI ML Scalability & Performance reading group, I co-presented a talk on the paper “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.”

Another member of the reading group (Ben) presented an overview of the theory, and I presented my Triton kernel implementation of Flash Attention 2.

The code can be found here.

Papers:

  1. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

  2. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Recording:

ML Scalability & Performance Reading Group Session 2: Flash Attention