Advanced CUDA Programming Course
Welcome to the Advanced CUDA Programming Course. This course covers high-performance kernel development for modern NVIDIA GPUs, from core concepts to the latest features in Ampere, Hopper, and Blackwell architectures.
Course Outline
Part 1 — Core Concepts Recap
- 1.1 The Execution Hierarchy: Threads, Blocks, and Grids
- 1.2 Warps and the SIMT Execution Model
- 1.3 Occupancy
- 1.4 Latency Hiding
- 1.5 Control Divergence
- 1.6 Global Memory Coalescing
- 1.7 Demonstration Kernel — SAXPY
- 1.8 Exercise — Parallel Reduction
Part 2 — Thread Coarsening and Vectorized Memory Access
- 2.1 Per-thread overhead
- † Instruction breakdown
- 2.2 Bytes in flight
- 2.3 Thread coarsening
- 2.4
__launch_bounds__,__restrict__and#pragma unroll - † The effect of restrict at the assembly level
- 2.5 Persistent Kernels
- 2.6 Vectorized load and store operations
Part 3 — Warp Shuffles, Reductions, and Cooperative Groups
- 3.1 Reduction: from Shared Memory to Warp Shuffles
- † Floating-point reductions
- 3.2 Masks,
__activemask, and Why It Is Not Enough - 3.3 Cooperative Groups
Part 4 — Asynchronous Data Movement: LDGSTS
- 4.1 The Problem with Synchronous Loads
- 4.2 Asynchronous Copy to Shared Memory
- 4.3 Software Pipelining
- 4.4 Putting It Together — Pipelined SAXPY
- † L1 Bypass