Part 2 — Thread Coarsening and Vectorized Memory Access
In this section, we will look at how to improve the performance of memory-bound kernels by reducing the number of overhead instructions and by using larger (vectorized) access sizes.
Consider the the baseline SAXPY kernel as the standard example of an embarrassingly parallel GPU workload. When running a naive implementation like the one from the introduction on a "consumer" GPU like the RTX Pro 5000, everything looks great. But running the exact same code on a datacentre B200, suddenly there is severe underutilization of the available memory bandwidth.
| Metric | RTX Pro 5000 | B200 |
|---|---|---|
| Instructions | 19,922,944 | 19,922,944 |
| Compute util | 16.75% | 41.13% |
| DRAM util | 92.87% | 36.15% |
| DRAM BW | 1.11 TB/s | 2.41 TB/s |
On datacentre GPUs, memory bandwidth is much larger, yet CUDA core compute is similar to non-datacentre models. Thus, the kernel does run twice as fast on the B200 compared to the RTX 5000 Pro, with twice the compute utilization, but the available bandwidth on the B200 is far more than double that of the RTX.
To use the B200 to its full potential, we need to adapt our code!
2.1 Per-thread overhead
In the saxpy baseline, each thread does exactly 2 FLOPs and 3 memory operations. Recall from Part 1 that the warp is the unit at which instructions are issued — and it is also the unit at which the profiler counts them. For an input size of $N=2^{25} = 33,554,432$ elements, we get $1,048,576$ warps and expect:
- $2,097,152$
LDG(Load Global) instructions to fetchx[i]andy[i](2 per warp) - $1,048,576$
FFMA(Fused Multiply-Add) instructions to computey[i] = alpha * x[i] + y[i](1 per warp) - $1,048,576$
STG(Store Global) instructions to storey[i](1 per warp)
Let's look at the generated assembly:
.visible .entry saxpy(
.param .u32 saxpy_param_0,
.param .f32 saxpy_param_1,
.param .u64 saxpy_param_2,
.param .u64 saxpy_param_3
)
{
.reg .pred %p<2>;
.reg .f32 %f<5>;
.reg .b32 %r<6>;
.reg .b64 %rd<8>;
// __global__ void saxpy(int n, float alpha, const float* x, float* y) {
ld.param.u32 %r2, [saxpy_param_0];
ld.param.f32 %f1, [saxpy_param_1];
ld.param.u64 %rd1, [saxpy_param_2];
ld.param.u64 %rd2, [saxpy_param_3];
// int i = blockIdx.x * blockDim.x + threadIdx.x;
mov.u32 %r3, %ctaid.x;
mov.u32 %r4, %ntid.x;
mov.u32 %r5, %tid.x;
mad.lo.s32 %r1, %r3, %r4, %r5;
// if (i < n)
setp.ge.s32 %p1, %r1, %r2;
@%p1 bra $L__BB0_2;
// int i = blockIdx.x * blockDim.x + threadIdx.x;
cvta.to.global.u64 %rd3, %rd2;
cvta.to.global.u64 %rd4, %rd1;
// y[i] = alpha * x[i] + y[i];
mul.wide.s32 %rd5, %r1, 4;
add.s64 %rd6, %rd4, %rd5;
ld.global.f32 %f2, [%rd6];
add.s64 %rd7, %rd3, %rd5;
ld.global.f32 %f3, [%rd7];
fma.rn.f32 %f4, %f2, %f1, %f3;
st.global.f32 [%rd7], %f4;
$L__BB0_2:
// }
ret;
}
We see a lot of additional things:
- Four
ld.paraminstructions to load the kernel parameters. - Three
mov.u32to fetch block and thread info, andmad.lo.s32calculates the target indexi. setp.ge.s32sets a predicate and is followed by a predicated branch- The
cvta.to.globalinstructions ensure that the pointersxandypoint to global memory space. - Finally, the last block mixes address calculation (
mul.wide.s32andadd.s64) followed by loadingld.global, the actual arithmetic (fma), and storingst.global.
Most of these instructions are overhead and not part of the kernel's actual work. Profiling this kernel with ncu gives the following table of executed instructions:
| Opcode | Count | Visualization |
|---|---|---|
| LDC | 4,194,304 | |
| LDCU | 3,145,728 | |
| IMAD | 3,145,728 | |
| LDG | 2,097,152 | |
| EXIT | 2,097,152 | |
| FFMA | 1,048,576 | |
| STG | 1,048,576 | |
| S2R | 1,048,576 | |
| ISETP | 1,048,576 | |
| S2UR | 1,048,576 |
The green bars show the actual counts, while the blue overlays show our theoretical "useful" work. See below for a detailed breakdown of where these instructions are coming from.
AdvancedInstruction breakdown
The GPU-level instructions (SASS) do not correspond one-to-one with the PTX virtual instructions. However, for a kernel this simple we can easily reconstruct the mapping.
Let's start with the easy identifications:
S2RandS2URare the Special to (Uniform) Register instructions, responsible for loading thread and block indices. We get exactly two per thread, as expected.IMADhandles Integer Multiplication and Addition, and is used to calculate the target indexi, as well as the source addressesx + iandy + i. Note that in bytes,x + iis actuallyx + 4*i. The SASS compiler fuses the multiplication by 4 and addition into a single instruction, even with the PTX source doing individual operations. We get three per thread, as expected.LDGandSTGare the Load and Store from/to Global instructions, with 2 loads and 1 store per thread.FFMAis the Fused Floating Multiply-Add, handlingalpha * x[i] + y[i]. Only a single useful arithmetic instruction is issued per thread.ISETPevaluates the condition in theifstatement, issuing one instruction per thread.
Now for the less direct translations:
LDCandLDCUare Load from Constant memory to register or uniform register. These are used to load the kernel parameters. We expect a total of four of these, but get fourLDCand threeLDCU. One additional load comes from fetchingblockDim.x, and one more because a base pointer for the global memory descriptor is loaded. The last one is the stack pointer: the ABI reservesR1for it, and ptxas initializes it in the kernel prologue even though this kernel never spills or calls functions.EXITcorresponds to thereturnstatement in the kernel. At first glance, the numbers seem confusing, because profiling tells us that we had twice as many exits as threads. However, this just means that the assembler generated an early exit as a predicatedEXITinstruction. This instruction gets issued, but all of the participating threads are masked out from actual execution by predication. Then they hit a secondEXITat the end of the kernel, and this time they stop the thread. Note that because predication is used, we do not see anyBRA(Branch) instructions at all.
LDC R1, c[0x0][0x37c] // stack pointer (ABI reserves R1)
S2R R0, SR_TID.X // threadIdx.x
S2UR UR4, SR_CTAID.X // blockIdx.x (uniform)
LDCU UR5, c[0x0][0x380] // n (uniform)
LDC R7, c[0x0][0x360] // blockDim.x
IMAD R7, R7, UR4, R0 // threadIdx.x + blockIdx.x * blockDim.x
ISETP.GE.AND P0, PT, R7, UR5, PT // P0 = (i >= n)
@P0 EXIT // early return if out of bounds
LDC.64 R2, c[0x0][0x388] // x (base pointer)
LDCU.64 UR4, c[0x0][0x358] // global memory base pointer
LDCU UR6, c[0x0][0x384] // alpha (uniform)
LDC.64 R4, c[0x0][0x390] // y (base pointer)
IMAD.WIDE R2, R7, 0x4, R2 // x + 4*i (byte offset)
LDG.E R2, desc[UR4][R2.64] // R2 = x[i]
IMAD.WIDE R4, R7, 0x4, R4 // y + 4*i (byte offset)
LDG.E R7, desc[UR4][R4.64] // R7 = y[i]
FFMA R7, R2, UR6, R7 // R7 = alpha * x[i] + y[i]
STG.E desc[UR4][R4.64], R7 // y[i] = R7
EXIT // return
2.2 Bytes in flight
In addition to the instruction overhead we just discussed, there is a second aspect to be considered when writing memory-bound kernels:
Think of the path from the SMs to DRAM as a long pipe. It has high latency — any single byte takes a while to traverse it — but, especially on datacentre GPUs, also high bandwidth. The only way to use the full bandwidth is to keep the pipe full: to always have enough requests outstanding that data is arriving every cycle. This concept is known as having enough bytes in flight.
This is Little's Law. If the memory system has latency L and we keep B bytes in flight on average, the bandwidth we realise is BW = B / L.
So how much can the naive kernel keep in flight? At 100% occupancy each SM of a B200 holds 64 resident warps.
Each warp issues two independent global loads (x[i] and y[i]) before it needs either result, and each load moves 32×4 = 128 bytes: 64 × 2 × 128 = 16 KiB per SM, or about 2368 KiB across all 148 SMs.
B200's global-memory latency is roughly 428 ns according to chipsandcheese,
so the most this configuration could ever achieve is 2368 KiB / 428 ns ≈ 5.66 TB/s.
So far we have only counted the loads, but SAXPY also writes: one STG per element stores the result back to y[i].
As stores are fire-and-forget, they typically do not introduce any warp stalls of their own. With one store for every two reads,
this leads to a maximum bandwidth of 8.49 TB/s, above the maximum available hardware bandwidth.
Does that mean this consideration is meaningless for our current setup? Not quite: The number of 8.49 TB/s assumes
a constant occupancy of 100%, with all warps permanently waiting on their outstanding loads while the previous block's
writes are still ongoing. While we have a theoretical occupancy of 100%, scheduling overheads and imbalances lead to an actual
occupancy of only 76.04%.
Furthermore, as we saw in the previous section, each warp spends a considerable amount of time not waiting on its outstanding loads.
As such, we can still expect that improving bytes in flight will help us achieve higher bandwidth. To see how to achieve that, we can reframe Little's law as
(requests/warp in flight) × (active warps) × (bytes per request)
BW = ────────────────────────────────────────────────────────────────
latency
This gives us four potential ways of increasing the achieved bandwidth:
- Decrease latency. As the access latency of memory is approximately fixed, the only way to achieve this is by making sure data is read from a lower-latency memory, i.e., lower-level caches. For computations in which the same data elements are reused multiple times, a carefully chosen memory access pattern or explicit caching in shared memory might achieve this. But in a streaming kernel that touches each data element exactly once, latency is essentially constant.
- Increase occupancy More concurrent warps support more memory requests in flight to keep the memory subsystem busy. Increasing occupancy generally requires reducing the resource requirements of each warp,
e.g., using fewer registers or less shared memory per block. In case of the
saxpykernel, we already have full occupancy, so this alone is not sufficient. - More requests per warp If each warp ensures that it has multiple outstanding loads in flight, the total number of bytes in flight increases. Typically, this requires thread coarsening as discussed in Section 2.3, and leads to more register pressure, so it can be at odds with occupancy.
- Move more bytes per request. If each request reads or writes more data, then the same number of requests can support higher bandwidth. In that case, each thread receives more data, thus a prerequisite for this change is thread coarsening.
2.3 Thread coarsening
The inefficiency we want to fix here is that each thread does a lot of setup, only to do just four useful instructions. A natural solution is to keep that same thread around for longer and reuse the setup multiple times. In that case, each thread processes multiple elements; we have coarsened the granularity of the element-to-thread mapping.
You may have encountered a similar strategy already, just from a slightly different perspective. In computations where different data elements need to be combined in multiple ways (e.g., matmul), having each thread handle multiple outputs is beneficial because it facilitates data reuse, see, e.g., Kernel 2 in the PPC course.
Even though there is no payload data to be reused here, there is still implicit data reuse by amortizing the loads of kernel parameters and special registers.
How should we map threads to elements? Assume we want to handle 4 outputs per thread.
✦ Solution
A naïve mapping in which thread i handles elements i*4, i*4+1, i*4+2, i*4+3 reduces the number of instructions,
but wreaks havoc on the memory subsystem, as each load instruction now requests a strided access pattern per warp.
Thus, we need to move the stride to warp or block level. For a block of 256 threads, we can have thread i handle
elements i, i + 256, i + 512, i + 768. Then each warp still accesses 32 consecutive elements.
Let's apply the strategy above to the saxpy kernel. With EL_PER_THREAD=4 elements per thread,
we need to adjust the kernel launcher to
void launch_saxpy(int n, float alpha, float* x, float* y) {
constexpr int BLOCK = 256;
// Each block covers BLOCK*EL_PER_THREAD elements, so shrink the grid accordingly.
const int grid = (n + BLOCK * EL_PER_THREAD - 1) / (BLOCK * EL_PER_THREAD);
saxpy_v2<<<grid, BLOCK>>>(n, alpha, x, y);
}
The main kernel implements the block-strided access pattern:
__global__ void saxpy_v2(int n, float alpha,
const float* x, float* y) {
// Base index for this thread; stride by blockDim.x so consecutive threads
// always access consecutive addresses (coalescing preserved).
int base = blockIdx.x * blockDim.x * EL_PER_THREAD + threadIdx.x;
int stride = blockDim.x;
#pragma unroll
for (int k = 0; k < EL_PER_THREAD; ++k) {
int i = base + k * stride;
if (i < n)
y[i] = alpha * x[i] + y[i];
}
}
Comparing with the original kernel, we see a marked improvement on the B200:
| Metric | V1 | V2 |
|---|---|---|
| Instructions | 19,922,944 | 14,680,064 |
| Compute util | 41.13% | 33.65% |
| DRAM BW | 36.15% | 45.55% |
| DRAM Read | 256.1 MiB | 256.0 MiB |
| DRAM Write | 89.4 MiB | 90.0 MiB |
| Cycles | 173,914 | 138,173 |
| Duration | 150.6 µs | 119.7 µs |
Further reductions in the number of instructions are possible by lifting the if statement out of the hot loop. One could,
for example, test once before the loop whether 4 elements are available. If so, a loop without any conditions can be used,
otherwise, the same loop as above is run. This is demonstrated below
__global__ void saxpy_v3a(int n, float alpha,
const float* x, float* y) {
// Base index for this thread; stride by blockDim.x so consecutive threads
// always access consecutive addresses (coalescing preserved).
int base = blockIdx.x * blockDim.x * EL_PER_THREAD + threadIdx.x;
int stride = blockDim.x;
if (base + (EL_PER_THREAD - 1) * stride < n) {
// Fast path: all EL_PER_THREAD elements are in bounds,
// so the per-element guard can be dropped.
#pragma unroll
for (int k = 0; k < EL_PER_THREAD; ++k) {
int i = base + k * stride;
y[i] = alpha * x[i] + y[i];
}
} else {
// Tail: same guarded loop as v2.
#pragma unroll
for (int k = 0; k < EL_PER_THREAD; ++k) {
int i = base + k * stride;
if (i < n)
y[i] = alpha * x[i] + y[i];
}
}
}
While this results in another considerable reduction in executed instructions, the effect on the running time is minimal. The cost of these instructions was mostly hidden behind the memory latency anyway. We also see that these optimizations did indeed come at the cost of slightly increased register pressure.
| Metric | V1 | V2 | V3 |
|---|---|---|---|
| Instructions | 19,922,944 | 14,680,064 | 11,796,480 |
| Compute util | 41.13% | 33.65% | 17.06% |
| DRAM BW | 36.15% | 45.55% | 46.16% |
| DRAM Read | 256.1 MiB | 256.0 MiB | 256.0 MiB |
| DRAM Write | 89.4 MiB | 90.0 MiB | 89.5 MiB |
| Cycles | 173,914 | 138,173 | 136,177 |
| Duration | 150.6 µs | 119.7 µs | 117.9 µs |
| Registers | 16 | 18 | 22 |
Below, you'll find some more detailed profiling results:
📊 Instructions
Except for the IADD3 instruction, most likely responsible for address calculations, the top contributors to instruction count
are now the FFMA and LDG instructions that perform the actual desired work.
| Opcode | Count | Visualization |
|---|---|---|
| IADD3 | 3,145,728 | |
| LDG | 2,097,152 | |
| FFMA | 1,048,576 | |
| STG | 1,048,576 | |
| IMAD | 1,048,576 | |
| LDC | 1,048,576 | |
| LDCU | 786,432 | |
| EXIT | 262,144 | |
| SHF | 262,144 | |
| S2R | 262,144 | |
| ISETP | 262,144 | |
| [OTHER] | 524,288 |
2.4 __launch_bounds__, __restrict__ and #pragma unroll
Before optimizing further, let us consider three very cheap ways to help the compiler generate better code by adding one-line annotations.
__restrict__
Coarsening gave each thread four elements, and with them eight independent loads — exactly the kind of work Section 2.2 said we need in order to keep more bytes in flight.
But that only pays off if the compiler actually issues those loads early, before it stores anything back. By default it will not, because it cannot prove that x and y point to disjoint memory.
If the two were to overlap, a store to y[j] could change a value that a later load of x reads, so the compiler must keep the loads and stores in program order —
which leaves only a couple of loads outstanding at any moment.
We can remove that doubt by promising that the pointers never alias. Marking them __restrict__:
__global__ void saxpy_v3b(int n, float alpha,
const float* __restrict__ x, float* __restrict__ y)
tells the compiler that nothing written through y is ever read through x.
This is a promise the compiler trusts but does not check: if the pointers really do alias at runtime, the behaviour is undefined.
※ For saxpy that costs us almost nothing, since nearly every aliasing call is already invalid for other reasons
Does that impose new constraints on the caller, the way assuming n % 4 == 0 did? Not really. Calling saxpy(n, alpha, x, x+1) is already undefined behaviour:
thread 0 would write to a location read by thread 1, and without fences that is a race condition.
But saxpy(n, alpha, x, x+256) would have been valid, because there the read/write overlap stays within a single thread. So __restrict__ only outlaws cases like the latter.
This is also what makes the analysis hard for the compiler by default: with no promise to lean on,
it must assume the legal-but-aliasing case is possible and emit conservative code.
As the table below shows, applying restrict does give slightly better code generation. Despite the same number of instructions, we can extract 1-2% more memory bandwidth.
| Metric | V3a | V3b |
|---|---|---|
| Instructions | 11,796,480 | 11,796,480 |
| Compute util | 17.06% | 17.64% |
| DRAM BW | 46.16% | 47.84% |
| Cycles | 136,177 | 131,364 |
| Duration | 117.9 µs | 113.8 µs |
| Registers | 22 | 22 |
__launch_bounds__
A second dilemma that the compiler faces is that it generally can't know a priori which trade-off to make: Use many registers to better hide the memory latency within a single thread, or minimize the number of registers to increase the occupancy and hide the memory latency by giving the scheduler more warps to choose from. To make matters worse, choosing the first option can lead to cases in which certain block sizes are not executable anymore at all, because their number of required registers exceeds the 65k that are available to each SM.
In many cases each kernel is only called for one or a few different block sizes. Using the __launch_bounds__(MAX_BLOCK_SIZE) attribute, we can inform the compiler that
we promise never to launch the kernel with a number of threads per block larger than MAX_BLOCK_SIZE. If this size is smaller than the maximum block size for the device, then this allows
the compiler to distribute the registers more liberally, enabling better thread-level latency hiding.
In the extreme case, this could lead to only a single, small block fitting on an SM, which also may be undesirable. For that reason,
launch_bounds accepts a second argument, which is the minimum number of blocks that must fit concurrently. This serves to specify a
lower bound on occupancy that the compiler must guarantee.
For details, see the CUDA guide.
#pragma unroll
If you look at high-performance cuda code that you can find on the internet, you'll often find #pragma unroll sprinkled all over the code.
The idea is that you want to force the compiler to turn the loop, which has a fixed number of iterations, into a sequence of instructions that is repeated verbatim in
the generated assembly, thus avoiding the cost of loop instructions.
While generally harmless, in most cases these #pragma unroll are effectively a no-op: The compiler would have unrolled the loop even without a pragma.
This is because, in addition to the loop instructions themselves, there is a second cost associated with most loops. Consider
| Loop | Unrolled |
|---|---|
|
|
In the runtime loop version, a is accessed with a dynamic index. Consequently, the runtime version cannot keep a in registers,
and must go to local memory for each access. Not only is access even to L1 cache easily 10x slower than reading a register, but
because of the way cache coherency is implemented in the GPU, L1 caches are write-through, meaning every write to a must go all the way to the L2 cache,
causing needless traffic.
Because of the tremendous cost associated with not unrolling such a loop, the compiler will almost certainly unroll it, pragma or no pragma.
However, in some cases unrolling a loop would result in a very large number of instructions, and the compiler might decide to prioritize code size over loop unrolling.
In such a scenario, a well-placed #pragma unroll can help speed up the produced code significantly.
AdvancedThe effect of __restrict__ at the assembly level
Concretely, the reordering we are after looks like this at the PTX level:
// load x
ld.global.f32 %f1, [%rd1+0];
ld.global.f32 %f2, [%rd1+1024];
ld.global.f32 %f3, [%rd1+2048];
ld.global.f32 %f4, [%rd1+3072];
// load y
ld.global.f32 %f5, [%rd2+0];
ld.global.f32 %f6, [%rd2+1024];
ld.global.f32 %f7, [%rd2+2048];
ld.global.f32 %f8, [%rd2+3072];
// compute
fma.rn.f32 %f10, %f9, %f1, %f5;
fma.rn.f32 %f11, %f9, %f2, %f6;
fma.rn.f32 %f12, %f9, %f3, %f7;
fma.rn.f32 %f13, %f9, %f4, %f8;
// store
st.global.f32 [%rd2+0], %f10;
st.global.f32 [%rd2+1024], %f11;
st.global.f32 [%rd2+2048], %f12;
st.global.f32 [%rd2+3072], %f13;
Let's compare with the real kernels, which also need instructions to compute addresses and other housekeeping.
Without __restrict__, we get the expected sequence of two loads followed by one store:
mul.wide.s32 %rd14, %r3, 4;
add.s64 %rd15, %rd2, %rd14;
ld.global.f32 %f11, [%rd15];
add.s64 %rd16, %rd1, %rd14;
ld.global.f32 %f12, [%rd16];
fma.rn.f32 %f13, %f1, %f11, %f12;
st.global.f32 [%rd16], %f13;
cvt.u64.u32 %rd17, %r2;
add.s64 %rd18, %rd15, %rd17;
ld.global.f32 %f14, [%rd18];
add.s64 %rd19, %rd16, %rd17;
ld.global.f32 %f15, [%rd19];
fma.rn.f32 %f16, %f1, %f14, %f15;
st.global.f32 [%rd19], %f16;
add.s64 %rd20, %rd18, %rd17;
ld.global.f32 %f17, [%rd20];
add.s64 %rd21, %rd19, %rd17;
ld.global.f32 %f18, [%rd21];
fma.rn.f32 %f19, %f1, %f17, %f18;
st.global.f32 [%rd21], %f19;
add.s64 %rd22, %rd20, %rd17;
ld.global.f32 %f20, [%rd22];
add.s64 %rd23, %rd21, %rd17;
ld.global.f32 %f21, [%rd23];
fma.rn.f32 %f22, %f1, %f20, %f21;
st.global.f32 [%rd23], %f22;
With __restrict__, we also get two loads followed by one store, so it seems the qualifier didn't have any effect?
mul.wide.s32 %rd14, %r3, 4;
add.s64 %rd15, %rd2, %rd14;
ld.global.nc.f32 %f11, [%rd15];
add.s64 %rd16, %rd1, %rd14;
ld.global.f32 %f12, [%rd16];
fma.rn.f32 %f13, %f1, %f11, %f12;
st.global.f32 [%rd16], %f13;
cvt.u64.u32 %rd17, %r2;
add.s64 %rd18, %rd15, %rd17;
ld.global.nc.f32 %f14, [%rd18];
add.s64 %rd19, %rd16, %rd17;
ld.global.f32 %f15, [%rd19];
fma.rn.f32 %f16, %f1, %f14, %f15;
st.global.f32 [%rd19], %f16;
add.s64 %rd20, %rd18, %rd17;
ld.global.nc.f32 %f17, [%rd20];
add.s64 %rd21, %rd19, %rd17;
ld.global.f32 %f18, [%rd21];
fma.rn.f32 %f19, %f1, %f17, %f18;
st.global.f32 [%rd21], %f19;
add.s64 %rd22, %rd20, %rd17;
ld.global.nc.f32 %f20, [%rd22];
add.s64 %rd23, %rd21, %rd17;
ld.global.f32 %f21, [%rd23];
fma.rn.f32 %f22, %f1, %f20, %f21;
st.global.f32 [%rd23], %f22;
However, looking closely, we do see that the assembly changed. Load instructions for the x array gained an additional .nc suffix.
This suffix indicates that these loads can be performed non-coherent. No other thread is allowed to write to this memory location,
so all the hardware mechanisms that are normally employed to ensure that the different threads can see a "consistent" view of the memory are disabled.
Interestingly, when compiling with nvcc 13.3 (the course material uses cuda 13.0), there is indeed a reordering of instructions:
ld.global.nc.f32 %f11, [%rd3];
ld.global.f32 %f12, [%rd4];
fma.rn.f32 %f13, %f11, %f1, %f12;
st.global.f32 [%rd4], %f13;
mul.wide.s32 %rd14, %r1, 4;
add.s64 %rd15, %rd3, %rd14;
ld.global.nc.f32 %f14, [%rd15];
add.s64 %rd16, %rd4, %rd14;
ld.global.f32 %f15, [%rd16];
fma.rn.f32 %f16, %f14, %f1, %f15;
ld.global.nc.f32 %f17, [%rd5]; // third load between two stores
st.global.f32 [%rd16], %f16;
ld.global.f32 %f18, [%rd6];
fma.rn.f32 %f19, %f17, %f1, %f18;
mul.wide.s32 %rd17, %r3, 4;
add.s64 %rd18, %rd3, %rd17;
ld.global.nc.f32 %f20, [%rd18];
add.s64 %rd19, %rd4, %rd17;
st.global.f32 [%rd6], %f19;
ld.global.f32 %f21, [%rd19]; // only one load between two stores
fma.rn.f32 %f22, %f20, %f1, %f21;
st.global.f32 [%rd19], %f22;
bra.uni $L__BB2_8;
2.5 Persistent Kernels
Even with the coarsening introduced above, we have not changed the fundamental decomposition of the problem: Each thread handles a fixed amount of work, and the total problem size is controlled by adjusting the number of blocks that get launched.
We can go further and make the amount of threads fixed, with each thread performing more work as the problem size grows. The number of threads is chosen just so that the maximum possible occupancy is reached, that is, so that the GPU handles exactly one wave of threads. As individual threads persist from beginning to end of the kernel run, these kernels are called persistent kernels.
One way of implementing persistent kernels is via a grid-strided loop: The out-of-bounds check becomes the loop condition, and each iteration increments the index by the total number of threads in the grid.
For the saxpy kernel, it looks like this:
__global__ void saxpy_persistent(int n, float alpha,
const float* __restrict__ x,
float* __restrict__ y) {
int stride = gridDim.x * blockDim.x;
int base = blockIdx.x * blockDim.x + threadIdx.x;
for (int base_i = base; base_i < n; base_i += stride * COARSE) {
#pragma unroll
for (int k = 0; k < COARSE; ++k) {
int i = base_i + k * stride;
if (i < n)
y[i] = alpha * x[i] + y[i];
}
}
}
We see another improvement over the existing previous version:
| Metric | V1 | V2 | V3 | V4 |
|---|---|---|---|---|
| Instructions | 19,922,944 | 14,680,064 | 11,796,480 | 12,398,848 |
| Compute util | 41.13% | 33.65% | 17.64% | 38.39% |
| DRAM BW | 36.15% | 45.55% | 47.84% | 55.38% |
| DRAM Read | 256.1 MiB | 256.0 MiB | 256.0 MiB | 256.0 MiB |
| DRAM Write | 89.4 MiB | 90.0 MiB | 89.5 MiB | 89.0 MiB |
| Cycles | 173,914 | 138,173 | 131,364 | 113,318 |
| Duration | 150.6 µs | 119.7 µs | 113.8 µs | 98.2 µs |
Generally, persistent kernels are useful when the setup or teardown code per block is expensive, such as if large constants have to be fetched or if there are reductions. On the flip side, a persistent kernel removes the ability of the GPU scheduler to do work balancing. If one block is slower for some reason (possibly even just being scheduled later because of work on a different stream), there is no way to use free resources on other SMs to rebalance the workload. As such, careful consideration should be given to whether the kernel is suitable for persistent execution.
2.6 Vectorized load and store operations
Even in the ideal case of infinite coarsening, only a tiny fraction of instructions would be useful arithmetic. For every FFMA, there
are at least two loads, one store, and two address calculations. Can we do better?
This is where vectorized load and store operations come in. With those, a single thread can load up to 16 bytes (up to 32 on Blackwell) with a single instruction. Correspondingly, the full warp reads or writes a total of 512 bytes.
CUDA provides dedicated data types for vectors of up to four elements comprised of built-in types. These are declared in vector_types.h,
and follow the naming convention of type name concatenated with number of elements, that is, int2 would be a vector of two ints, and
float4 a vector of four floats. The individual elements can be accessed via .x, .y, .z, and .w.
Applying this to the earlier kernel version 3, we have to replace the loop with explicit access to the individual vector elements:
__global__ void saxpy_v5(int n, float alpha,
const float4* __restrict__ x, float4* __restrict__ y)
{
// Base index for this thread; stride by blockDim.x so consecutive threads
// always access consecutive addresses (coalescing preserved).
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
float4 a = x[i];
float4 b = y[i];
b.x = alpha * a.x + b.x;
b.y = alpha * a.y + b.y;
b.z = alpha * a.z + b.z;
b.w = alpha * a.w + b.w;
y[i] = b;
}
}
Importantly, this changes the thread-to-element mapping yet again, back to the natural ordering. Even though the stride between threads is now 16 bytes, as each thread reads the entire vector, this is still a coalesced access.
If we peek at the assembly, we see that the compiler now generates a single ld.global.v4.f32 instruction for the entire vector:
.visible .entry saxpy_v5(
.param .u32 saxpy_v5_param_0,
.param .f32 saxpy_v5_param_1,
.param .u64 saxpy_v5_param_2,
.param .u64 saxpy_v5_param_3
)
{
.reg .pred %p<2>;
.reg .f32 %f<22>;
.reg .b32 %r<6>;
.reg .b64 %rd<8>;
// __global__ void saxpy_v5(int n, float alpha,
ld.param.u32 %r2, [saxpy_v5_param_0];
ld.param.f32 %f1, [saxpy_v5_param_1];
ld.param.u64 %rd1, [saxpy_v5_param_2];
ld.param.u64 %rd2, [saxpy_v5_param_3];
// // Base index for this thread; stride by blockDim.x so consecutive threads
// // always access consecutive addresses (coalescing preserved).
// int i = blockIdx.x * blockDim.x + threadIdx.x;
mov.u32 %r3, %ctaid.x;
mov.u32 %r4, %ntid.x;
mov.u32 %r5, %tid.x;
mad.lo.s32 %r1, %r3, %r4, %r5;
// if (i < n) {
setp.ge.s32 %p1, %r1, %r2;
@%p1 bra $L__BB0_2;
// int i = blockIdx.x * blockDim.x + threadIdx.x;
cvta.to.global.u64 %rd3, %rd2;
cvta.to.global.u64 %rd4, %rd1;
// float4 a = x[i];
mul.wide.s32 %rd5, %r1, 16;
add.s64 %rd6, %rd4, %rd5;
ld.global.nc.v4.f32 {%f2, %f3, %f4, %f5}, [%rd6];
// float4 b = y[i];
add.s64 %rd7, %rd3, %rd5;
ld.global.v4.f32 {%f10, %f11, %f12, %f13}, [%rd7];
// b.y = alpha * a.y + b.y;
// b.z = alpha * a.z + b.z;
// b.w = alpha * a.w + b.w;
fma.rn.f32 %f18, %f5, %f1, %f13;
// b.z = alpha * a.z + b.z;
fma.rn.f32 %f19, %f4, %f1, %f12;
// b.x = alpha * a.x + b.x;
// b.y = alpha * a.y + b.y;
fma.rn.f32 %f20, %f3, %f1, %f11;
// b.x = alpha * a.x + b.x;
fma.rn.f32 %f21, %f2, %f1, %f10;
// y[i] = b;
st.global.v4.f32 [%rd7], {%f21, %f20, %f19, %f18};
$L__BB0_2:
// }
// }
ret;
}
There is one important condition for using vectorized load and store operations:
The source/target address must be aligned to the size of the vector type, i.e. float4 requires 16-byte alignment. Otherwise, the attempt will lead to a runtime crash.
Fortunately, cudaMalloc guarantees at least 256-byte alignment, and in many practical cases with regular data structures, alignment can be typically assumed. Still, it is at least advisable to put an alignment check into the kernel launcher, so that the asynchronous GPU crash can be turned into a proper program error.
With this improvement, we are finally getting almost 80% of the peak performance:
| Metric | V1 | V2 | V3 | V4 | V5 |
|---|---|---|---|---|---|
| Instructions | 19,922,944 | 14,680,064 | 11,796,480 | 12,398,848 | 5,767,168 |
| Compute util | 41.13% | 33.65% | 17.06% | 38.39% | 22.60% |
| DRAM BW | 36.15% | 45.55% | 46.16% | 55.38% | 79.06% |
| DRAM Read | 256.1 MiB | 256.0 MiB | 256.0 MiB | 256.0 MiB | 256.0 MiB |
| DRAM Write | 89.4 MiB | 90.0 MiB | 89.5 MiB | 89.0 MiB | 86.0 MiB |
| Cycles | 173,914 | 138,173 | 136,177 | 113,318 | 78,565 |
| Duration | 150.6 µs | 119.7 µs | 117.9 µs | 98.2 µs | 68.1 µs |
Of course, vectorized loads could also be combined with further coarsening, so that multiple vectors are handled by each thread. This is left as an exercise to the reader and can be explored as part of the
bfloat16 axpy kernel below.
📊 Instructions
Not only do we reduce the number of LDG and STG by a factor of four, but we also only need one fourth of the address calculations, so IMAD
is also reduced.
| Opcode | Count | Visualization |
|---|---|---|
| FFMA | 1,048,576 | |
| LDC | 1,048,576 | |
| LDCU | 786,432 | |
| IMAD | 786,432 | |
| LDG | 524,288 | |
| EXIT | 524,288 | |
| S2R | 262,144 | |
| ISETP | 262,144 | |
| S2UR | 262,144 | |
| [OTHER] | 262,144 |
Note: In some situations, the compiler is able to automatically vectorize memory accesses if it can be certain that alignment requirements are met,
such as when accessing static shared memory or when using hints like __builtin_assume_aligned. As we have seen in the examples, however,
vectorized memory access typically goes together with a change in access pattern, which is not something the compiler can do automatically.
Under rare circumstances it is also possible that the compiler uses scalar memory loads even though the code is written in vectorized form.
Note: Starting with Blackwell, the GPU supports 256-bit vector loads and stores for global memory (but not for shared memory). There is no nice C interface for it, you'll need to write PTX.
2.7 Exercise — AXPY with bfloat16
Implement "alpha times x plus y" for brainfloat16 inputs and outputs.
In this warmup exercise, implement an axpy kernel with the bfloat16 type instead of standard
single precision. As this data type is only 16 bits wide, vectorized memory access is especially important.
Interface
The function to be implemented is:
void haxpy(int n, float alpha, const nv_bfloat16* x, nv_bfloat16* y);
Here, n is the number of elements in x and y, and alpha is a finite scalar.
The number of elements can be an arbitrary positive integer, and the starting addresses of the
device-memory arrays x and y are only guaranteed to align to the requirement of nv_bfloat16,
that is, to two bytes. If you use vectorized loads, you need to handle alignment yourself. However, you
may assume that both arrays are offset by the same amount, i.e., ((uintptr_t)x % 256 == (uintptr_t)y % 256).
Background: bfloat16
bfloat16 (BF16) is one of two standard 16-bit floating-point formats. It uses the same 8-bit exponent as float32 (same dynamic range) but only 7 mantissa bits instead of 23 (lower precision).
The bfloat16 data type has been available since the Ampere architecture (A100, RTX 30xx), and is provided under the name __nv_bfloat16 in cuda_bf16.h.
By now, it has been established as the de-facto standard for deep learning; it is half as big as float32, enjoys native tensor core support and,
contrary to half (FP16), its dynamic range means it does not require complicated loss scaling techniques to function properly.
Hints
※ Vectorization and BF16
The only vector type available in CUDA is __nv_bfloat162, which is just two BF16 values for a total of 4 bytes.
To use larger vectorized loads, we have three options:
1) Define our own bf16x8 type that is 16-byte aligned and has 8 elements. We can just follow the pattern with which
the cuda headers define their vector types.
2) Use reinterpret_cast to handle all the data transfers in a natively supported vector type.
3) Use __builtin_assume_aligned to signal the compiler that alignment can be assumed, and hope that it transforms an
unrolled loop into a vectorized access.
Custom type
struct __align__(16) bf16x8 {
__nv_bfloat162 lo2; // elements 0–1
__nv_bfloat162 ml2; // elements 2–3
__nv_bfloat162 mh2; // elements 4–5
__nv_bfloat162 hi2; // elements 6–7
};
Casting
nv_bfloat16 dst[16];
reinterpret_cast<int4*>(dst) = *reinterpret_cast<const int4*>(src);
You can test your solution locally:
python run.py test