Part 1 — Core Concepts Recap
1.1 The Execution Hierarchy: Threads, Blocks, and Grids
CUDA organizes parallel work in a three-level hierarchy:
Grid
└── Block (up to 1024 threads)
└── Warp (32 threads, the hardware scheduling unit)
└── Thread
- Thread — the finest unit of work; each has its own registers and a unique
threadIdx. - Block — a cooperative group of threads that share
__shared__memory and can synchronize via__syncthreads(). A block runs on a single Streaming Multiprocessor (SM) and never migrates. - Grid — the full set of blocks launched by a kernel call. Blocks in a grid are independent and may execute in any order.
Key implication: inter-block communication during a kernel is not directly supported. Design algorithms so that blocks are independent, or use cooperative groups (covered later).
1.2 Warps and the SIMT Execution Model
The GPU does not schedule individual threads — it schedules warps of 32 consecutive threads. All threads in a warp execute the same instruction at the same time (Single Instruction, Multiple Threads).
Block of 128 threads → 4 warps (warp 0: threads 0–31, warp 1: threads 32–63, …)
Warps are the granularity at which the SM's warp scheduler issues instructions. Understanding warps is essential for reasoning about performance.
Since Volta, no guarantees about warp synchrony
1.3 Occupancy
Occupancy is the ratio of resident warps per SM to the maximum warps an SM can support (64 on Volta and on datacentre Ampere/Hopper/Blackwell, 48 on consumer Ampere/Ada/Blackwell, 32 on Turing).
A resident warp is stalled when it is waiting on a long-latency operation (memory load, math unit result). The SM hides that latency by switching to another ready resident warp. More resident warps → better latency hiding.
Occupancy is limited by whichever resource you exhaust first:
| Resource | Configured by | Effect when over-used |
|---|---|---|
| Registers | Compiler (--maxrregcount) |
Fewer blocks/warps fit per SM |
| Shared memory | __shared__ declarations |
Fewer blocks fit per SM |
| Block size | Launch config <<<grid, block>>> |
Too small = too few warps; too large = register pressure |
| Hardware warp limit | Architecture constant | Hard ceiling |
Tip: Use
nvcc --ptxas-options=-vor Nsight Compute to inspect register and shared memory usage per kernel.
// Query occupancy programmatically
int minGridSize, blockSize;
cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, myKernel, 0, 0);
1.4 Latency Hiding
Memory latency on a GPU is ~hundreds of cycles. Rather than stalling the pipeline, the SM's warp scheduler issues instructions from other warps while a stalled warp waits.
Cycle 0: Warp A issues global load (goes pending ~400 cycles)
Cycle 1: Warp B issues ALU op ← runs while A waits
Cycle 2: Warp C issues global load
...
Cycle 400: Warp A load completes → warp A becomes ready again
This is latency hiding through warp-level parallelism — the GPU equivalent of out-of-order execution. The implication: you need enough warps in flight to keep the SM busy. Low occupancy = exposed latency = poor utilization.
1.5 Control Divergence
When threads within a warp take different branches, the warp must execute both paths serially, with lanes masked off for the inactive path.
// BAD: threads diverge based on thread ID parity
if (threadIdx.x % 2 == 0) {
doEvenWork(); // half the warp active
} else {
doOddWork(); // other half active ← serialized, not parallel
}
Performance impact: in the worst case (half/half split), throughput is halved for the divergent region. Nested divergence compounds quickly.
Mitigation strategies:
- Restructure data so adjacent threads (same warp) take the same branch.
- Use predicated/branchless arithmetic where the branch body is cheap.
- Accept divergence only at warp boundaries (if (threadIdx.x / 32 == ...) is fine).
1.6 Global Memory Coalescing
Global memory is accessed in 128-byte cache-line transactions. A warp's 32 threads each request a 4-byte float, touching 128 bytes total — a perfect single transaction if the addresses are contiguous.
// Coalesced — 32 consecutive floats, one 128-byte transaction
float val = A[blockDim.x * blockIdx.x + threadIdx.x];
// Strided — each thread jumps by N, multiple transactions
float val = A[(blockDim.x * blockIdx.x + threadIdx.x) * N];
// Random — up to 32 separate transactions (worst case)
float val = A[randomIndex[threadIdx.x]];
Rule of thumb: index global arrays with threadIdx.x varying in the innermost (fastest) dimension. This matches the row-major storage of C arrays and produces coalesced access.
1.7 Demonstration Kernel — SAXPY
SAXPY ("Single-precision A·X Plus Y") is the canonical GPU micro-benchmark:
Y[i] = alpha * X[i] + Y[i]
It is compute-light (2 FLOPs per element) and memory-bound, making it a good test of memory bandwidth and coalescing.
#include <cuda_runtime.h>
#include <cstdio>
#include <cassert>
#include <cmath>
#include <cstdlib>
#include <vector>
#define CUDA_CHECK(call) assert((call) == cudaSuccess)
__global__ void saxpy(int n, float alpha, const float* x, float* y) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
y[i] = alpha * x[i] + y[i];
}
void launch_saxpy(int n, float a, float* x, float* y) {
constexpr int BLOCK = 256;
saxpy<<<(n + BLOCK - 1) / BLOCK, BLOCK>>>(n, a, x, y);
}
int main() {
const int N = 1 << 25; // 32 M elements
const float A = 2.0f;
const size_t sz = N * sizeof(float);
std::vector<float> hX(N), hY(N);
for (int i = 0; i < N; i++) {
hX[i] = 0.001f * i;
hY[i] = 1.0f;
}
float *dX, *dY;
CUDA_CHECK(cudaMalloc(&dX, sz));
CUDA_CHECK(cudaMalloc(&dY, sz));
CUDA_CHECK(cudaMemcpy(dX, hX.data(), sz, cudaMemcpyHostToDevice));
CUDA_CHECK(cudaMemcpy(dY, hY.data(), sz, cudaMemcpyHostToDevice));
launch_saxpy(N, A, dX, dY);
CUDA_CHECK(cudaGetLastError()); // errors from the launch itself
CUDA_CHECK(cudaDeviceSynchronize()); // errors from kernel execution
CUDA_CHECK(cudaMemcpy(hY.data(), dY, sz, cudaMemcpyDeviceToHost));
for (int i : {0, 1, 12345, N - 1})
std::printf("y[%8d] = %g\n", i, (double)hY[i]);
CUDA_CHECK(cudaFree(dX));
CUDA_CHECK(cudaFree(dY));
return 0;
}
Points to observe in this kernel:
| Feature | Where | What it demonstrates |
|---|---|---|
| Coalesced load | x[i] where i increments with threadIdx.x |
Perfect 128-byte transactions |
| Coalesced store | y[i] same pattern |
No extra transactions |
| Bounds guard | if (i < n) |
Handles non-multiple-of-blockDim N |
| Memory-bound | ~2 FLOPs / 12 bytes | Roofline sits on the memory bandwidth wall |
| Stall Reason | Count | Visualization |
|---|---|---|
| Long Scoreboard | 48 | |
| Short Scoreboard | 3 | |
| Wait | 3 | |
| [OTHER] | 4 |
Now it is your turn. SAXPY only reads and writes — the next step up is a kernel where the threads have to cooperate: summing a whole array down to a single number with a shared-memory tree reduction.
1.8 Exercise — Parallel Reduction
Sum an array of floats into a single value with a shared-memory reduction.
In this exercise you implement the standard shared-memory tree reduction: each block cooperatively reduces a chunk of the input to one partial sum. You only write the per-block reduction — the harness adds the per-block partials together for you. This is the textbook baseline, and a good first exercise in getting threads to cooperate through shared memory.
Interface
The function to be implemented is:
void reduce(int n, const float* in, float* out);
Here n is the number of elements in in (an arbitrary positive integer). Each block
reduces its own chunk of the input to a single partial sum and writes it to
out[blockIdx.x] — one slot per block. The harness then sums those partials on the host
to get the grand total, so you do not need atomics or a second pass.
out is large enough to hold one partial per block, and is zero-filled before each run.
Pick any block size that is a multiple of the warp size (32); a block size that does not
divide n is fine, as long as you launch ⌈n / blockDim⌉ blocks to cover every element.
The result is checked against a float64 reference with a tolerance that scales with the size and magnitude of the input, so an honest fp32 reduction will pass comfortably.
Background: parallel reduction
A reduction collapses n values into one with an associative operator (here, +). On a
GPU we do it as a tree: a block loads its slice into shared memory, then repeatedly has
half the threads add a partner's value until a single value remains. With B threads per
block the tree finishes in log2(B) steps instead of B.
The reduction is memory-bound: it reads 4n bytes and writes only one float per
block, so the 2 FLOPs of addition per element are dwarfed by the load traffic. The number
to watch is DRAM bandwidth — how close you get to reading the input at the hardware's peak
rate.
Hints
※ Writing one partial per block
Block blockIdx.x owns the input slice
in[blockIdx.x * blockDim.x ... blockIdx.x * blockDim.x + blockDim.x). After the in-block
tree leaves the block's total in sdata[0], thread 0 writes it out:
if (tid == 0) out[blockIdx.x] = sdata[0];
That is the whole cross-block story — no atomics, no second kernel. The host sums the
⌈n / blockDim⌉ partials.
※ Interleaved vs. sequential addressing
The most natural-looking tree uses interleaved addressing — if (tid % (2*s) == 0) with
s doubling each step. It works, but the active threads are spread across the warp, so
whole warps run with only a few lanes doing useful work (control divergence, §1.5), and
the modulo costs extra instructions.
Keeping the active threads contiguous instead — for (s = blockDim.x/2; s > 0; s >>= 1)
with if (tid < s) — lets whole warps retire together. Same number of adds, much less
divergence. Try the obvious version first, profile it, then fix it.
※ Don't forget the tail
n is not necessarily a multiple of the block size. Threads whose global index is >= n
must contribute the identity (0.0f) rather than reading past the end of the array.
You can test your solution locally:
python run.py test