EleutherAI ML Perf Reading Group: An intro to GPU architecture, CUDA, NCCL, and common ML performance bottlenecks

For session 1 of the EleutherAI ML Scalability & Performance reading group, I gave a presentation covering GPU architecture, CUDA, NCCL, and common ML performance bottlenecks.

We didn’t discuss any research papers in this session; the goal was to go over pre-requisite knowledge to get the most out of reading research papers related to ML scalability and performance.

Note: you may have to disable ad blocker for the YouTube player to render correctly. Alternatively, you can watch the recording directly on YouTube here.