An intro to GPU architecture, CUDA, NCCL, and common ML performance bottlenecks
For session 1 of the EleutherAI ML Scalability & Performance reading group, I gave a presentation covering GPU architecture, CUDA, NCCL, and common ML performance bottlenecks.
We didn’t discuss any research papers in this session; the goal was to go over pre-requisite knowledge to get the most out of reading research papers related to ML scalability and performance.
Recording: