Part 2 — Thread Coarsening and Vectorized Memory Access

In this section, we will look at how to improve the performance of memory-bound kernels by reducing the number of overhead instructions and by using larger (vectorized) access sizes.

Consider the the baseline SAXPY kernel as the standard example of an embarrassingly parallel GPU workload. When running a naive implementation like the one from the introduction on a "consumer" GPU like the RTX Pro 5000, everything looks great. But running the exact same code on a datacentre B200, suddenly there is severe underutilization of the available memory bandwidth.

Metric RTX Pro 5000 B200
Instructions 19,922,944 19,922,944
Compute util 16.75% 41.13%
DRAM util 92.87% 36.15%
DRAM BW 1.11 TB/s 2.41 TB/s

On datacentre GPUs, memory bandwidth is much larger, yet CUDA core compute is similar to non-datacentre models. Thus, the kernel does run twice as fast on the B200 compared to the RTX 5000 Pro, with twice the compute utilization, but the available bandwidth on the B200 is far more than double that of the RTX.

To use the B200 to its full potential, we need to adapt our code!

2.1 Per-thread overhead

In the saxpy baseline, each thread does exactly 2 FLOPs and 3 memory operations. Recall from Part 1 that the warp is the unit at which instructions are issued — and it is also the unit at which the profiler counts them. For an input size of $N=2^{25} = 33,554,432$ elements, we get $1,048,576$ warps and expect:

Let's look at the generated assembly:

.visible .entry saxpy(
    .param .u32 saxpy_param_0,
    .param .f32 saxpy_param_1,
    .param .u64 saxpy_param_2,
    .param .u64 saxpy_param_3
)
{
    .reg .pred  %p<2>;
    .reg .f32   %f<5>;
    .reg .b32   %r<6>;
    .reg .b64   %rd<8>;

// __global__ void saxpy(int n, float alpha, const float* x, float* y) {


    ld.param.u32    %r2, [saxpy_param_0];
    ld.param.f32    %f1, [saxpy_param_1];
    ld.param.u64    %rd1, [saxpy_param_2];
    ld.param.u64    %rd2, [saxpy_param_3];

// int i = blockIdx.x * blockDim.x + threadIdx.x;
    mov.u32     %r3, %ctaid.x;
    mov.u32     %r4, %ntid.x;
    mov.u32     %r5, %tid.x;
    mad.lo.s32  %r1, %r3, %r4, %r5;

// if (i < n)
    setp.ge.s32     %p1, %r1, %r2;
    @%p1 bra    $L__BB0_2;


// int i = blockIdx.x * blockDim.x + threadIdx.x;
    cvta.to.global.u64  %rd3, %rd2;
    cvta.to.global.u64  %rd4, %rd1;

// y[i] = alpha * x[i] + y[i];
    mul.wide.s32    %rd5, %r1, 4;
    add.s64     %rd6, %rd4, %rd5;
    ld.global.f32   %f2, [%rd6];
    add.s64     %rd7, %rd3, %rd5;
    ld.global.f32   %f3, [%rd7];
    fma.rn.f32  %f4, %f2, %f1, %f3;
    st.global.f32   [%rd7], %f4;

$L__BB0_2:

// }
    ret;

}

We see a lot of additional things:

Most of these instructions are overhead and not part of the kernel's actual work. Profiling this kernel with ncu gives the following table of executed instructions:

OpcodeCountVisualization
LDC4,194,304
LDCU3,145,728
IMAD3,145,728
LDG2,097,152
EXIT2,097,152
FFMA1,048,576
STG1,048,576
S2R1,048,576
ISETP1,048,576
S2UR1,048,576

The green bars show the actual counts, while the blue overlays show our theoretical "useful" work. See below for a detailed breakdown of where these instructions are coming from.

AdvancedInstruction breakdown

The GPU-level instructions (SASS) do not correspond one-to-one with the PTX virtual instructions. However, for a kernel this simple we can easily reconstruct the mapping.

Let's start with the easy identifications:

  • S2R and S2UR are the Special to (Uniform) Register instructions, responsible for loading thread and block indices. We get exactly two per thread, as expected.
  • IMAD handles Integer Multiplication and Addition, and is used to calculate the target index i, as well as the source addresses x + i and y + i. Note that in bytes, x + i is actually x + 4*i. The SASS compiler fuses the multiplication by 4 and addition into a single instruction, even with the PTX source doing individual operations. We get three per thread, as expected.
  • LDG and STG are the Load and Store from/to Global instructions, with 2 loads and 1 store per thread.
  • FFMA is the Fused Floating Multiply-Add, handling alpha * x[i] + y[i]. Only a single useful arithmetic instruction is issued per thread.
  • ISETP evaluates the condition in the if statement, issuing one instruction per thread.

Now for the less direct translations:

  • LDC and LDCU are Load from Constant memory to register or uniform register. These are used to load the kernel parameters. We expect a total of four of these, but get four LDC and three LDCU. One additional load comes from fetching blockDim.x, and one more because a base pointer for the global memory descriptor is loaded. The last one is the stack pointer: the ABI reserves R1 for it, and ptxas initializes it in the kernel prologue even though this kernel never spills or calls functions.
  • EXIT corresponds to the return statement in the kernel. At first glance, the numbers seem confusing, because profiling tells us that we had twice as many exits as threads. However, this just means that the assembler generated an early exit as a predicated EXIT instruction. This instruction gets issued, but all of the participating threads are masked out from actual execution by predication. Then they hit a second EXIT at the end of the kernel, and this time they stop the thread. Note that because predication is used, we do not see any BRA (Branch) instructions at all.
      LDC R1, c[0x0][0x37c]                     // stack pointer (ABI reserves R1)
      S2R R0, SR_TID.X                          // threadIdx.x
      S2UR UR4, SR_CTAID.X                      // blockIdx.x (uniform)
      LDCU UR5, c[0x0][0x380]                   // n (uniform)
      LDC R7, c[0x0][0x360]                     // blockDim.x
      IMAD R7, R7, UR4, R0                      // threadIdx.x + blockIdx.x * blockDim.x
      ISETP.GE.AND P0, PT, R7, UR5, PT          // P0 = (i >= n)
@P0   EXIT                                      // early return if out of bounds
      LDC.64 R2, c[0x0][0x388]                  // x (base pointer)
      LDCU.64 UR4, c[0x0][0x358]                // global memory base pointer
      LDCU UR6, c[0x0][0x384]                   // alpha (uniform)
      LDC.64 R4, c[0x0][0x390]                  // y (base pointer)
      IMAD.WIDE R2, R7, 0x4, R2                 // x + 4*i (byte offset)
      LDG.E R2, desc[UR4][R2.64]                // R2 = x[i]
      IMAD.WIDE R4, R7, 0x4, R4                 // y + 4*i (byte offset)
      LDG.E R7, desc[UR4][R4.64]                // R7 = y[i]
      FFMA R7, R2, UR6, R7                      // R7 = alpha * x[i] + y[i]
      STG.E desc[UR4][R4.64], R7                // y[i] = R7
      EXIT                                      // return

2.2 Bytes in flight

In addition to the instruction overhead we just discussed, there is a second aspect to be considered when writing memory-bound kernels:

Think of the path from the SMs to DRAM as a long pipe. It has high latency — any single byte takes a while to traverse it — but, especially on datacentre GPUs, also high bandwidth. The only way to use the full bandwidth is to keep the pipe full: to always have enough requests outstanding that data is arriving every cycle. This concept is known as having enough bytes in flight.

This is Little's Law. If the memory system has latency L and we keep B bytes in flight on average, the bandwidth we realise is BW = B / L.

So how much can the naive kernel keep in flight? At 100% occupancy each SM of a B200 holds 64 resident warps. Each warp issues two independent global loads (x[i] and y[i]) before it needs either result, and each load moves 32×4 = 128 bytes: 64 × 2 × 128 = 16 KiB per SM, or about 2368 KiB across all 148 SMs. B200's global-memory latency is roughly 428 ns according to chipsandcheese, so the most this configuration could ever achieve is 2368 KiB / 428 ns ≈ 5.66 TB/s.

So far we have only counted the loads, but SAXPY also writes: one STG per element stores the result back to y[i]. As stores are fire-and-forget, they typically do not introduce any warp stalls of their own. With one store for every two reads, this leads to a maximum bandwidth of 8.49 TB/s, above the maximum available hardware bandwidth.

Does that mean this consideration is meaningless for our current setup? Not quite: The number of 8.49 TB/s assumes a constant occupancy of 100%, with all warps permanently waiting on their outstanding loads while the previous block's writes are still ongoing. While we have a theoretical occupancy of 100%, scheduling overheads and imbalances lead to an actual occupancy of only 76.04%. Furthermore, as we saw in the previous section, each warp spends a considerable amount of time not waiting on its outstanding loads.

As such, we can still expect that improving bytes in flight will help us achieve higher bandwidth. To see how to achieve that, we can reframe Little's law as

       (requests/warp in flight) × (active warps) × (bytes per request)
BW  =  ────────────────────────────────────────────────────────────────
                                      latency

This gives us four potential ways of increasing the achieved bandwidth:


2.3 Thread coarsening

The inefficiency we want to fix here is that each thread does a lot of setup, only to do just four useful instructions. A natural solution is to keep that same thread around for longer and reuse the setup multiple times. In that case, each thread processes multiple elements; we have coarsened the granularity of the element-to-thread mapping.

You may have encountered a similar strategy already, just from a slightly different perspective. In computations where different data elements need to be combined in multiple ways (e.g., matmul), having each thread handle multiple outputs is beneficial because it facilitates data reuse, see, e.g., Kernel 2 in the PPC course.

Even though there is no payload data to be reused here, there is still implicit data reuse by amortizing the loads of kernel parameters and special registers.

How should we map threads to elements? Assume we want to handle 4 outputs per thread.

✦ Solution

A naïve mapping in which thread i handles elements i*4, i*4+1, i*4+2, i*4+3 reduces the number of instructions, but wreaks havoc on the memory subsystem, as each load instruction now requests a strided access pattern per warp.

Thus, we need to move the stride to warp or block level. For a block of 256 threads, we can have thread i handle elements i, i + 256, i + 512, i + 768. Then each warp still accesses 32 consecutive elements.

Let's apply the strategy above to the saxpy kernel. With EL_PER_THREAD=4 elements per thread, we need to adjust the kernel launcher to

void launch_saxpy(int n, float alpha, float* x, float* y) {
    constexpr int BLOCK = 256;
    // Each block covers BLOCK*EL_PER_THREAD elements, so shrink the grid accordingly.
    const int grid = (n + BLOCK * EL_PER_THREAD - 1) / (BLOCK * EL_PER_THREAD);
    saxpy_v2<<<grid, BLOCK>>>(n, alpha, x, y);
}

The main kernel implements the block-strided access pattern:

__global__ void saxpy_v2(int n, float alpha,
                         const float* x, float* y) {
    // Base index for this thread; stride by blockDim.x so consecutive threads
    // always access consecutive addresses (coalescing preserved).
    int base   = blockIdx.x * blockDim.x * EL_PER_THREAD + threadIdx.x;
    int stride = blockDim.x;

    #pragma unroll
    for (int k = 0; k < EL_PER_THREAD; ++k) {
        int i = base + k * stride;
        if (i < n)
            y[i] = alpha * x[i] + y[i];
    }
}

Comparing with the original kernel, we see a marked improvement on the B200:

Metric V1 V2
Instructions 19,922,944 14,680,064
Compute util 41.13% 33.65%
DRAM BW 36.15% 45.55%
DRAM Read 256.1 MiB 256.0 MiB
DRAM Write 89.4 MiB 90.0 MiB
Cycles 173,914 138,173
Duration 150.6 µs 119.7 µs

Further reductions in the number of instructions are possible by lifting the if statement out of the hot loop. One could, for example, test once before the loop whether 4 elements are available. If so, a loop without any conditions can be used, otherwise, the same loop as above is run. This is demonstrated below

__global__ void saxpy_v3a(int n, float alpha,
                         const float* x, float* y) {
    // Base index for this thread; stride by blockDim.x so consecutive threads
    // always access consecutive addresses (coalescing preserved).
    int base   = blockIdx.x * blockDim.x * EL_PER_THREAD + threadIdx.x;
    int stride = blockDim.x;

    if (base + (EL_PER_THREAD - 1) * stride < n) {
        // Fast path: all EL_PER_THREAD elements are in bounds,
        // so the per-element guard can be dropped.
        #pragma unroll
        for (int k = 0; k < EL_PER_THREAD; ++k) {
            int i = base + k * stride;
            y[i] = alpha * x[i] + y[i];
        }
    } else {
        // Tail: same guarded loop as v2.
        #pragma unroll
        for (int k = 0; k < EL_PER_THREAD; ++k) {
            int i = base + k * stride;
            if (i < n)
                y[i] = alpha * x[i] + y[i];
        }
    }
}

While this results in another considerable reduction in executed instructions, the effect on the running time is minimal. The cost of these instructions was mostly hidden behind the memory latency anyway. We also see that these optimizations did indeed come at the cost of slightly increased register pressure.

Metric V1 V2 V3
Instructions 19,922,944 14,680,064 11,796,480
Compute util 41.13% 33.65% 17.06%
DRAM BW 36.15% 45.55% 46.16%
DRAM Read 256.1 MiB 256.0 MiB 256.0 MiB
DRAM Write 89.4 MiB 90.0 MiB 89.5 MiB
Cycles 173,914 138,173 136,177
Duration 150.6 µs 119.7 µs 117.9 µs
Registers 16 18 22

Below, you'll find some more detailed profiling results:

📊 Instructions

Except for the IADD3 instruction, most likely responsible for address calculations, the top contributors to instruction count are now the FFMA and LDG instructions that perform the actual desired work.

Opcode Count Visualization
IADD3 3,145,728
LDG 2,097,152
FFMA 1,048,576
STG 1,048,576
IMAD 1,048,576
LDC 1,048,576
LDCU 786,432
EXIT 262,144
SHF 262,144
S2R 262,144
ISETP 262,144
[OTHER] 524,288

2.4 __launch_bounds__, __restrict__ and #pragma unroll

Before optimizing further, let us consider three very cheap ways to help the compiler generate better code by adding one-line annotations.

__restrict__

Coarsening gave each thread four elements, and with them eight independent loads — exactly the kind of work Section 2.2 said we need in order to keep more bytes in flight. But that only pays off if the compiler actually issues those loads early, before it stores anything back. By default it will not, because it cannot prove that x and y point to disjoint memory. If the two were to overlap, a store to y[j] could change a value that a later load of x reads, so the compiler must keep the loads and stores in program order — which leaves only a couple of loads outstanding at any moment.

We can remove that doubt by promising that the pointers never alias. Marking them __restrict__:

__global__ void saxpy_v3b(int n, float alpha,
                          const float* __restrict__ x, float* __restrict__ y)

tells the compiler that nothing written through y is ever read through x.

This is a promise the compiler trusts but does not check: if the pointers really do alias at runtime, the behaviour is undefined.

※ For saxpy that costs us almost nothing, since nearly every aliasing call is already invalid for other reasons

Does that impose new constraints on the caller, the way assuming n % 4 == 0 did? Not really. Calling saxpy(n, alpha, x, x+1) is already undefined behaviour: thread 0 would write to a location read by thread 1, and without fences that is a race condition. But saxpy(n, alpha, x, x+256) would have been valid, because there the read/write overlap stays within a single thread. So __restrict__ only outlaws cases like the latter. This is also what makes the analysis hard for the compiler by default: with no promise to lean on, it must assume the legal-but-aliasing case is possible and emit conservative code.

As the table below shows, applying restrict does give slightly better code generation. Despite the same number of instructions, we can extract 1-2% more memory bandwidth.

Metric V3a V3b
Instructions 11,796,480 11,796,480
Compute util 17.06% 17.64%
DRAM BW 46.16% 47.84%
Cycles 136,177 131,364
Duration 117.9 µs 113.8 µs
Registers 22 22

__launch_bounds__

A second dilemma that the compiler faces is that it generally can't know a priori which trade-off to make: Use many registers to better hide the memory latency within a single thread, or minimize the number of registers to increase the occupancy and hide the memory latency by giving the scheduler more warps to choose from. To make matters worse, choosing the first option can lead to cases in which certain block sizes are not executable anymore at all, because their number of required registers exceeds the 65k that are available to each SM.

In many cases each kernel is only called for one or a few different block sizes. Using the __launch_bounds__(MAX_BLOCK_SIZE) attribute, we can inform the compiler that we promise never to launch the kernel with a number of threads per block larger than MAX_BLOCK_SIZE. If this size is smaller than the maximum block size for the device, then this allows the compiler to distribute the registers more liberally, enabling better thread-level latency hiding.

In the extreme case, this could lead to only a single, small block fitting on an SM, which also may be undesirable. For that reason, launch_bounds accepts a second argument, which is the minimum number of blocks that must fit concurrently. This serves to specify a lower bound on occupancy that the compiler must guarantee. For details, see the CUDA guide.

#pragma unroll

If you look at high-performance cuda code that you can find on the internet, you'll often find #pragma unroll sprinkled all over the code. The idea is that you want to force the compiler to turn the loop, which has a fixed number of iterations, into a sequence of instructions that is repeated verbatim in the generated assembly, thus avoiding the cost of loop instructions.

While generally harmless, in most cases these #pragma unroll are effectively a no-op: The compiler would have unrolled the loop even without a pragma. This is because, in addition to the loop instructions themselves, there is a second cost associated with most loops. Consider

LoopUnrolled
for(int i = 0; i < 4; ++i) {
  a[i] = i;
}
//
a[0] = 0;
a[1] = 1;
a[2] = 2;
a[3] = 3;

In the runtime loop version, a is accessed with a dynamic index. Consequently, the runtime version cannot keep a in registers, and must go to local memory for each access. Not only is access even to L1 cache easily 10x slower than reading a register, but because of the way cache coherency is implemented in the GPU, L1 caches are write-through, meaning every write to a must go all the way to the L2 cache, causing needless traffic. Because of the tremendous cost associated with not unrolling such a loop, the compiler will almost certainly unroll it, pragma or no pragma.

However, in some cases unrolling a loop would result in a very large number of instructions, and the compiler might decide to prioritize code size over loop unrolling. In such a scenario, a well-placed #pragma unroll can help speed up the produced code significantly.

AdvancedThe effect of __restrict__ at the assembly level

Concretely, the reordering we are after looks like this at the PTX level:

// load x
ld.global.f32   %f1, [%rd1+0];
ld.global.f32   %f2, [%rd1+1024];
ld.global.f32   %f3, [%rd1+2048];
ld.global.f32   %f4, [%rd1+3072];
// load y
ld.global.f32   %f5, [%rd2+0];
ld.global.f32   %f6, [%rd2+1024];
ld.global.f32   %f7, [%rd2+2048];
ld.global.f32   %f8, [%rd2+3072];
// compute
fma.rn.f32  %f10, %f9, %f1, %f5;
fma.rn.f32  %f11, %f9, %f2, %f6;
fma.rn.f32  %f12, %f9, %f3, %f7;
fma.rn.f32  %f13, %f9, %f4, %f8;
// store
st.global.f32 [%rd2+0],    %f10;
st.global.f32 [%rd2+1024], %f11;
st.global.f32 [%rd2+2048], %f12;
st.global.f32 [%rd2+3072], %f13;

Let's compare with the real kernels, which also need instructions to compute addresses and other housekeeping. Without __restrict__, we get the expected sequence of two loads followed by one store:

mul.wide.s32    %rd14, %r3, 4;
add.s64     %rd15, %rd2, %rd14;
ld.global.f32   %f11, [%rd15];
add.s64     %rd16, %rd1, %rd14;
ld.global.f32   %f12, [%rd16];
fma.rn.f32  %f13, %f1, %f11, %f12;
st.global.f32   [%rd16], %f13;
cvt.u64.u32     %rd17, %r2;
add.s64     %rd18, %rd15, %rd17;
ld.global.f32   %f14, [%rd18];
add.s64     %rd19, %rd16, %rd17;
ld.global.f32   %f15, [%rd19];
fma.rn.f32  %f16, %f1, %f14, %f15;
st.global.f32   [%rd19], %f16;
add.s64     %rd20, %rd18, %rd17;
ld.global.f32   %f17, [%rd20];
add.s64     %rd21, %rd19, %rd17;
ld.global.f32   %f18, [%rd21];
fma.rn.f32  %f19, %f1, %f17, %f18;
st.global.f32   [%rd21], %f19;
add.s64     %rd22, %rd20, %rd17;
ld.global.f32   %f20, [%rd22];
add.s64     %rd23, %rd21, %rd17;
ld.global.f32   %f21, [%rd23];
fma.rn.f32  %f22, %f1, %f20, %f21;
st.global.f32   [%rd23], %f22;

With __restrict__, we also get two loads followed by one store, so it seems the qualifier didn't have any effect?

mul.wide.s32    %rd14, %r3, 4;
add.s64     %rd15, %rd2, %rd14;
ld.global.nc.f32    %f11, [%rd15];
add.s64     %rd16, %rd1, %rd14;
ld.global.f32   %f12, [%rd16];
fma.rn.f32  %f13, %f1, %f11, %f12;
st.global.f32   [%rd16], %f13;
cvt.u64.u32     %rd17, %r2;
add.s64     %rd18, %rd15, %rd17;
ld.global.nc.f32    %f14, [%rd18];
add.s64     %rd19, %rd16, %rd17;
ld.global.f32   %f15, [%rd19];
fma.rn.f32  %f16, %f1, %f14, %f15;
st.global.f32   [%rd19], %f16;
add.s64     %rd20, %rd18, %rd17;
ld.global.nc.f32    %f17, [%rd20];
add.s64     %rd21, %rd19, %rd17;
ld.global.f32   %f18, [%rd21];
fma.rn.f32  %f19, %f1, %f17, %f18;
st.global.f32   [%rd21], %f19;
add.s64     %rd22, %rd20, %rd17;
ld.global.nc.f32    %f20, [%rd22];
add.s64     %rd23, %rd21, %rd17;
ld.global.f32   %f21, [%rd23];
fma.rn.f32  %f22, %f1, %f20, %f21;
st.global.f32   [%rd23], %f22;

However, looking closely, we do see that the assembly changed. Load instructions for the x array gained an additional .nc suffix. This suffix indicates that these loads can be performed non-coherent. No other thread is allowed to write to this memory location, so all the hardware mechanisms that are normally employed to ensure that the different threads can see a "consistent" view of the memory are disabled.

Interestingly, when compiling with nvcc 13.3 (the course material uses cuda 13.0), there is indeed a reordering of instructions:

ld.global.nc.f32  %f11, [%rd3];
ld.global.f32     %f12, [%rd4];
fma.rn.f32        %f13, %f11, %f1, %f12;
st.global.f32     [%rd4], %f13;
mul.wide.s32      %rd14, %r1, 4;
add.s64           %rd15, %rd3, %rd14;
ld.global.nc.f32  %f14, [%rd15];
add.s64           %rd16, %rd4, %rd14;
ld.global.f32     %f15, [%rd16];
fma.rn.f32        %f16, %f14, %f1, %f15;
ld.global.nc.f32  %f17, [%rd5];           // third load between two stores
st.global.f32     [%rd16], %f16;
ld.global.f32     %f18, [%rd6];
fma.rn.f32        %f19, %f17, %f1, %f18;
mul.wide.s32      %rd17, %r3, 4;
add.s64           %rd18, %rd3, %rd17;
ld.global.nc.f32  %f20, [%rd18];
add.s64           %rd19, %rd4, %rd17;
st.global.f32     [%rd6], %f19;
ld.global.f32     %f21, [%rd19];          // only one load between two stores
fma.rn.f32        %f22, %f20, %f1, %f21;
st.global.f32     [%rd19], %f22;
bra.uni     $L__BB2_8;

2.5 Persistent Kernels

Even with the coarsening introduced above, we have not changed the fundamental decomposition of the problem: Each thread handles a fixed amount of work, and the total problem size is controlled by adjusting the number of blocks that get launched.

We can go further and make the amount of threads fixed, with each thread performing more work as the problem size grows. The number of threads is chosen just so that the maximum possible occupancy is reached, that is, so that the GPU handles exactly one wave of threads. As individual threads persist from beginning to end of the kernel run, these kernels are called persistent kernels.

One way of implementing persistent kernels is via a grid-strided loop: The out-of-bounds check becomes the loop condition, and each iteration increments the index by the total number of threads in the grid.

For the saxpy kernel, it looks like this:

__global__ void saxpy_persistent(int n, float alpha,
                                 const float* __restrict__ x,
                                       float* __restrict__ y) {
    int stride = gridDim.x * blockDim.x;
    int base   = blockIdx.x * blockDim.x + threadIdx.x;

    for (int base_i = base; base_i < n; base_i += stride * COARSE) {
        #pragma unroll
        for (int k = 0; k < COARSE; ++k) {
            int i = base_i + k * stride;
            if (i < n)
                y[i] = alpha * x[i] + y[i];
        }
    }
}

We see another improvement over the existing previous version:

Metric V1 V2 V3 V4
Instructions 19,922,944 14,680,064 11,796,480 12,398,848
Compute util 41.13% 33.65% 17.64% 38.39%
DRAM BW 36.15% 45.55% 47.84% 55.38%
DRAM Read 256.1 MiB 256.0 MiB 256.0 MiB 256.0 MiB
DRAM Write 89.4 MiB 90.0 MiB 89.5 MiB 89.0 MiB
Cycles 173,914 138,173 131,364 113,318
Duration 150.6 µs 119.7 µs 113.8 µs 98.2 µs

Generally, persistent kernels are useful when the setup or teardown code per block is expensive, such as if large constants have to be fetched or if there are reductions. On the flip side, a persistent kernel removes the ability of the GPU scheduler to do work balancing. If one block is slower for some reason (possibly even just being scheduled later because of work on a different stream), there is no way to use free resources on other SMs to rebalance the workload. As such, careful consideration should be given to whether the kernel is suitable for persistent execution.


2.6 Vectorized load and store operations

Even in the ideal case of infinite coarsening, only a tiny fraction of instructions would be useful arithmetic. For every FFMA, there are at least two loads, one store, and two address calculations. Can we do better?

This is where vectorized load and store operations come in. With those, a single thread can load up to 16 bytes (up to 32 on Blackwell) with a single instruction. Correspondingly, the full warp reads or writes a total of 512 bytes.

CUDA provides dedicated data types for vectors of up to four elements comprised of built-in types. These are declared in vector_types.h, and follow the naming convention of type name concatenated with number of elements, that is, int2 would be a vector of two ints, and float4 a vector of four floats. The individual elements can be accessed via .x, .y, .z, and .w.

Applying this to the earlier kernel version 3, we have to replace the loop with explicit access to the individual vector elements:

__global__ void saxpy_v5(int n, float alpha,
                         const float4* __restrict__ x, float4* __restrict__ y)
{
    // Base index for this thread; stride by blockDim.x so consecutive threads
    // always access consecutive addresses (coalescing preserved).
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        float4 a = x[i];
        float4 b = y[i];
        b.x = alpha * a.x + b.x;
        b.y = alpha * a.y + b.y;
        b.z = alpha * a.z + b.z;
        b.w = alpha * a.w + b.w;
        y[i] = b;
    }
}

Importantly, this changes the thread-to-element mapping yet again, back to the natural ordering. Even though the stride between threads is now 16 bytes, as each thread reads the entire vector, this is still a coalesced access.

If we peek at the assembly, we see that the compiler now generates a single ld.global.v4.f32 instruction for the entire vector:

.visible .entry saxpy_v5(
    .param .u32 saxpy_v5_param_0,
    .param .f32 saxpy_v5_param_1,
    .param .u64 saxpy_v5_param_2,
    .param .u64 saxpy_v5_param_3
)
{
    .reg .pred  %p<2>;
    .reg .f32   %f<22>;
    .reg .b32   %r<6>;
    .reg .b64   %rd<8>;

// __global__ void saxpy_v5(int n, float alpha,


    ld.param.u32    %r2, [saxpy_v5_param_0];
    ld.param.f32    %f1, [saxpy_v5_param_1];
    ld.param.u64    %rd1, [saxpy_v5_param_2];
    ld.param.u64    %rd2, [saxpy_v5_param_3];

// // Base index for this thread; stride by blockDim.x so consecutive threads
// // always access consecutive addresses (coalescing preserved).
// int i = blockIdx.x * blockDim.x + threadIdx.x;
    mov.u32     %r3, %ctaid.x;
    mov.u32     %r4, %ntid.x;
    mov.u32     %r5, %tid.x;
    mad.lo.s32  %r1, %r3, %r4, %r5;

// if (i < n) {
    setp.ge.s32     %p1, %r1, %r2;
    @%p1 bra    $L__BB0_2;


// int i = blockIdx.x * blockDim.x + threadIdx.x;
    cvta.to.global.u64  %rd3, %rd2;
    cvta.to.global.u64  %rd4, %rd1;

// float4 a = x[i];
    mul.wide.s32    %rd5, %r1, 16;
    add.s64     %rd6, %rd4, %rd5;
    ld.global.nc.v4.f32     {%f2, %f3, %f4, %f5}, [%rd6];

// float4 b = y[i];
    add.s64     %rd7, %rd3, %rd5;
    ld.global.v4.f32    {%f10, %f11, %f12, %f13}, [%rd7];

// b.y = alpha * a.y + b.y;
// b.z = alpha * a.z + b.z;
// b.w = alpha * a.w + b.w;
    fma.rn.f32  %f18, %f5, %f1, %f13;

// b.z = alpha * a.z + b.z;
    fma.rn.f32  %f19, %f4, %f1, %f12;

// b.x = alpha * a.x + b.x;
// b.y = alpha * a.y + b.y;
    fma.rn.f32  %f20, %f3, %f1, %f11;

// b.x = alpha * a.x + b.x;
    fma.rn.f32  %f21, %f2, %f1, %f10;

// y[i] = b;
    st.global.v4.f32    [%rd7], {%f21, %f20, %f19, %f18};

$L__BB0_2:

// }
// }
    ret;

}

There is one important condition for using vectorized load and store operations: The source/target address must be aligned to the size of the vector type, i.e. float4 requires 16-byte alignment. Otherwise, the attempt will lead to a runtime crash. Fortunately, cudaMalloc guarantees at least 256-byte alignment, and in many practical cases with regular data structures, alignment can be typically assumed. Still, it is at least advisable to put an alignment check into the kernel launcher, so that the asynchronous GPU crash can be turned into a proper program error.

With this improvement, we are finally getting almost 80% of the peak performance:

Metric V1 V2 V3 V4 V5
Instructions 19,922,944 14,680,064 11,796,480 12,398,848 5,767,168
Compute util 41.13% 33.65% 17.06% 38.39% 22.60%
DRAM BW 36.15% 45.55% 46.16% 55.38% 79.06%
DRAM Read 256.1 MiB 256.0 MiB 256.0 MiB 256.0 MiB 256.0 MiB
DRAM Write 89.4 MiB 90.0 MiB 89.5 MiB 89.0 MiB 86.0 MiB
Cycles 173,914 138,173 136,177 113,318 78,565
Duration 150.6 µs 119.7 µs 117.9 µs 98.2 µs 68.1 µs

Of course, vectorized loads could also be combined with further coarsening, so that multiple vectors are handled by each thread. This is left as an exercise to the reader and can be explored as part of the bfloat16 axpy kernel below.

📊 Instructions

Not only do we reduce the number of LDG and STG by a factor of four, but we also only need one fourth of the address calculations, so IMAD is also reduced.

Opcode Count Visualization
FFMA 1,048,576
LDC 1,048,576
LDCU 786,432
IMAD 786,432
LDG 524,288
EXIT 524,288
S2R 262,144
ISETP 262,144
S2UR 262,144
[OTHER] 262,144

Note: In some situations, the compiler is able to automatically vectorize memory accesses if it can be certain that alignment requirements are met, such as when accessing static shared memory or when using hints like __builtin_assume_aligned. As we have seen in the examples, however, vectorized memory access typically goes together with a change in access pattern, which is not something the compiler can do automatically.

Under rare circumstances it is also possible that the compiler uses scalar memory loads even though the code is written in vectorized form.


Note: Starting with Blackwell, the GPU supports 256-bit vector loads and stores for global memory (but not for shared memory). There is no nice C interface for it, you'll need to write PTX.


2.7 Exercise — AXPY with bfloat16

Implement "alpha times x plus y" for brainfloat16 inputs and outputs.

In this warmup exercise, implement an axpy kernel with the bfloat16 type instead of standard single precision. As this data type is only 16 bits wide, vectorized memory access is especially important.

Interface

The function to be implemented is:

void haxpy(int n, float alpha, const nv_bfloat16* x, nv_bfloat16* y);

Here, n is the number of elements in x and y, and alpha is a finite scalar. The number of elements can be an arbitrary positive integer, and the starting addresses of the device-memory arrays x and y are only guaranteed to align to the requirement of nv_bfloat16, that is, to two bytes. If you use vectorized loads, you need to handle alignment yourself. However, you may assume that both arrays are offset by the same amount, i.e., ((uintptr_t)x % 256 == (uintptr_t)y % 256).

Background: bfloat16

bfloat16 (BF16) is one of two standard 16-bit floating-point formats. It uses the same 8-bit exponent as float32 (same dynamic range) but only 7 mantissa bits instead of 23 (lower precision). The bfloat16 data type has been available since the Ampere architecture (A100, RTX 30xx), and is provided under the name __nv_bfloat16 in cuda_bf16.h. By now, it has been established as the de-facto standard for deep learning; it is half as big as float32, enjoys native tensor core support and, contrary to half (FP16), its dynamic range means it does not require complicated loss scaling techniques to function properly.

Hints

※ Vectorization and BF16

The only vector type available in CUDA is __nv_bfloat162, which is just two BF16 values for a total of 4 bytes. To use larger vectorized loads, we have three options: 1) Define our own bf16x8 type that is 16-byte aligned and has 8 elements. We can just follow the pattern with which the cuda headers define their vector types. 2) Use reinterpret_cast to handle all the data transfers in a natively supported vector type. 3) Use __builtin_assume_aligned to signal the compiler that alignment can be assumed, and hope that it transforms an unrolled loop into a vectorized access.

Custom type

struct __align__(16) bf16x8 {
    __nv_bfloat162 lo2;   // elements 0–1
    __nv_bfloat162 ml2;   // elements 2–3
    __nv_bfloat162 mh2;   // elements 4–5
    __nv_bfloat162 hi2;   // elements 6–7
};

Casting

nv_bfloat16 dst[16];
reinterpret_cast<int4*>(dst) = *reinterpret_cast<const int4*>(src);

You can test your solution locally:

python run.py test