FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

For session 2 of the EleutherAI ML Scalability & Performance reading group, I co-presented a talk on the paper “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.”

Another member of the reading group (Ben) presented an overview of the theory, and I presented my Triton kernel implementation of Flash Attention 2.

The code can be found here.

Papers:

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Recording: