Part 1 — Core Concepts Recap

1.1 The Execution Hierarchy: Threads, Blocks, and Grids

CUDA organizes parallel work in a three-level hierarchy:

Grid
 └── Block (up to 1024 threads)
      └── Warp (32 threads, the hardware scheduling unit)
           └── Thread

Key implication: inter-block communication during a kernel is not directly supported. Design algorithms so that blocks are independent, or use cooperative groups (covered later).


1.2 Warps and the SIMT Execution Model

The GPU does not schedule individual threads — it schedules warps of 32 consecutive threads. All threads in a warp execute the same instruction at the same time (Single Instruction, Multiple Threads).

Block of 128 threads  →  4 warps (warp 0: threads 0–31, warp 1: threads 32–63, …)

Warps are the granularity at which the SM's warp scheduler issues instructions. Understanding warps is essential for reasoning about performance.

Since Volta, no guarantees about warp synchrony


1.3 Occupancy

Occupancy is the ratio of resident warps per SM to the maximum warps an SM can support (64 on Volta and on datacentre Ampere/Hopper/Blackwell, 48 on consumer Ampere/Ada/Blackwell, 32 on Turing).

A resident warp is stalled when it is waiting on a long-latency operation (memory load, math unit result). The SM hides that latency by switching to another ready resident warp. More resident warps → better latency hiding.

Occupancy is limited by whichever resource you exhaust first:

Resource Configured by Effect when over-used
Registers Compiler (--maxrregcount) Fewer blocks/warps fit per SM
Shared memory __shared__ declarations Fewer blocks fit per SM
Block size Launch config <<<grid, block>>> Too small = too few warps; too large = register pressure
Hardware warp limit Architecture constant Hard ceiling

Tip: Use nvcc --ptxas-options=-v or Nsight Compute to inspect register and shared memory usage per kernel.

// Query occupancy programmatically
int minGridSize, blockSize;
cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, myKernel, 0, 0);

1.4 Latency Hiding

Memory latency on a GPU is ~hundreds of cycles. Rather than stalling the pipeline, the SM's warp scheduler issues instructions from other warps while a stalled warp waits.

Cycle 0:  Warp A issues global load  (goes pending ~400 cycles)
Cycle 1:  Warp B issues ALU op       ← runs while A waits
Cycle 2:  Warp C issues global load
...
Cycle 400: Warp A load completes → warp A becomes ready again

This is latency hiding through warp-level parallelism — the GPU equivalent of out-of-order execution. The implication: you need enough warps in flight to keep the SM busy. Low occupancy = exposed latency = poor utilization.


1.5 Control Divergence

When threads within a warp take different branches, the warp must execute both paths serially, with lanes masked off for the inactive path.

// BAD: threads diverge based on thread ID parity
if (threadIdx.x % 2 == 0) {
    doEvenWork();   // half the warp active
} else {
    doOddWork();    // other half active  ← serialized, not parallel
}

Performance impact: in the worst case (half/half split), throughput is halved for the divergent region. Nested divergence compounds quickly.

Mitigation strategies: - Restructure data so adjacent threads (same warp) take the same branch. - Use predicated/branchless arithmetic where the branch body is cheap. - Accept divergence only at warp boundaries (if (threadIdx.x / 32 == ...) is fine).


1.6 Global Memory Coalescing

Global memory is accessed in 128-byte cache-line transactions. A warp's 32 threads each request a 4-byte float, touching 128 bytes total — a perfect single transaction if the addresses are contiguous.

// Coalesced — 32 consecutive floats, one 128-byte transaction
float val = A[blockDim.x * blockIdx.x + threadIdx.x];

// Strided — each thread jumps by N, multiple transactions
float val = A[(blockDim.x * blockIdx.x + threadIdx.x) * N];

// Random — up to 32 separate transactions (worst case)
float val = A[randomIndex[threadIdx.x]];

Rule of thumb: index global arrays with threadIdx.x varying in the innermost (fastest) dimension. This matches the row-major storage of C arrays and produces coalesced access.


1.7 Demonstration Kernel — SAXPY

SAXPY ("Single-precision A·X Plus Y") is the canonical GPU micro-benchmark:

Y[i] = alpha * X[i] + Y[i]

It is compute-light (2 FLOPs per element) and memory-bound, making it a good test of memory bandwidth and coalescing.

#include <cuda_runtime.h>
#include <cstdio>
#include <cassert>
#include <cmath>
#include <cstdlib>
#include <vector>

#define CUDA_CHECK(call) assert((call) == cudaSuccess)

__global__ void saxpy(int n, float alpha, const float* x, float* y) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n)
        y[i] = alpha * x[i] + y[i];
}

void launch_saxpy(int n, float a, float* x, float* y) {
    constexpr int BLOCK = 256;
    saxpy<<<(n + BLOCK - 1) / BLOCK, BLOCK>>>(n, a, x, y);
}

int main() {
    const int N     = 1 << 25;   // 32 M elements
    const float A   = 2.0f;
    const size_t sz = N * sizeof(float);

    std::vector<float> hX(N), hY(N);
    for (int i = 0; i < N; i++) {
        hX[i] = 0.001f * i;
        hY[i] = 1.0f;
    }

    float *dX, *dY;
    CUDA_CHECK(cudaMalloc(&dX, sz));
    CUDA_CHECK(cudaMalloc(&dY, sz));
    CUDA_CHECK(cudaMemcpy(dX, hX.data(), sz, cudaMemcpyHostToDevice));
    CUDA_CHECK(cudaMemcpy(dY, hY.data(), sz, cudaMemcpyHostToDevice));

    launch_saxpy(N, A, dX, dY);
    CUDA_CHECK(cudaGetLastError());      // errors from the launch itself
    CUDA_CHECK(cudaDeviceSynchronize()); // errors from kernel execution

    CUDA_CHECK(cudaMemcpy(hY.data(), dY, sz, cudaMemcpyDeviceToHost));
    for (int i : {0, 1, 12345, N - 1})
        std::printf("y[%8d] = %g\n", i, (double)hY[i]);

    CUDA_CHECK(cudaFree(dX));
    CUDA_CHECK(cudaFree(dY));
    return 0;
}

Points to observe in this kernel:

Feature Where What it demonstrates
Coalesced load x[i] where i increments with threadIdx.x Perfect 128-byte transactions
Coalesced store y[i] same pattern No extra transactions
Bounds guard if (i < n) Handles non-multiple-of-blockDim N
Memory-bound ~2 FLOPs / 12 bytes Roofline sits on the memory bandwidth wall
Stall ReasonCountVisualization
Long Scoreboard 48
Short Scoreboard 3
Wait 3
[OTHER]4

Now it is your turn. SAXPY only reads and writes — the next step up is a kernel where the threads have to cooperate: summing a whole array down to a single number with a shared-memory tree reduction.

1.8 Exercise — Parallel Reduction

Sum an array of floats into a single value with a shared-memory reduction.

In this exercise you implement the standard shared-memory tree reduction: each block cooperatively reduces a chunk of the input to one partial sum. You only write the per-block reduction — the harness adds the per-block partials together for you. This is the textbook baseline, and a good first exercise in getting threads to cooperate through shared memory.

Interface

The function to be implemented is:

void reduce(int n, const float* in, float* out);

Here n is the number of elements in in (an arbitrary positive integer). Each block reduces its own chunk of the input to a single partial sum and writes it to out[blockIdx.x] — one slot per block. The harness then sums those partials on the host to get the grand total, so you do not need atomics or a second pass.

out is large enough to hold one partial per block, and is zero-filled before each run. Pick any block size that is a multiple of the warp size (32); a block size that does not divide n is fine, as long as you launch ⌈n / blockDim⌉ blocks to cover every element.

The result is checked against a float64 reference with a tolerance that scales with the size and magnitude of the input, so an honest fp32 reduction will pass comfortably.

Background: parallel reduction

A reduction collapses n values into one with an associative operator (here, +). On a GPU we do it as a tree: a block loads its slice into shared memory, then repeatedly has half the threads add a partner's value until a single value remains. With B threads per block the tree finishes in log2(B) steps instead of B.

The reduction is memory-bound: it reads 4n bytes and writes only one float per block, so the 2 FLOPs of addition per element are dwarfed by the load traffic. The number to watch is DRAM bandwidth — how close you get to reading the input at the hardware's peak rate.

Hints

※ Writing one partial per block

Block blockIdx.x owns the input slice in[blockIdx.x * blockDim.x ... blockIdx.x * blockDim.x + blockDim.x). After the in-block tree leaves the block's total in sdata[0], thread 0 writes it out:

if (tid == 0) out[blockIdx.x] = sdata[0];

That is the whole cross-block story — no atomics, no second kernel. The host sums the ⌈n / blockDim⌉ partials.

※ Interleaved vs. sequential addressing

The most natural-looking tree uses interleaved addressing — if (tid % (2*s) == 0) with s doubling each step. It works, but the active threads are spread across the warp, so whole warps run with only a few lanes doing useful work (control divergence, §1.5), and the modulo costs extra instructions.

Keeping the active threads contiguous instead — for (s = blockDim.x/2; s > 0; s >>= 1) with if (tid < s) — lets whole warps retire together. Same number of adds, much less divergence. Try the obvious version first, profile it, then fix it.

※ Don't forget the tail

n is not necessarily a multiple of the block size. Threads whose global index is >= n must contribute the identity (0.0f) rather than reading past the end of the array.

You can test your solution locally:

python run.py test