Part 4 — Asynchronous Data Movement: LDGSTS

4.1 The Problem with Synchronous Loads

Every kernel you have written so far uses synchronous global memory loads: the thread issues a load instruction, which instructs the LSU to fetch data from global memory and place it in the destination register. If the data is not already available in caches, hundreds of cycles may pass before the result is available from the register file.

This is not a problem in itself, because the GPU has two ways of hiding this latency. First, with sufficient occupancy, the warp scheduler can simply switch to another, ready warp, executing instructions from a different program stream until the load completes. Second, even though GPUs cannot perform out-of-order execution, a load instruction in itself does not stall the warp. It is only when the register into which the load places its result is needed by another instruction that execution needs to halt and wait for the load to complete.

This works until the number of outstanding requests exceeds the capacity of the GPU to track, in which case it needs to LG Throttle, that is, it actually stalls at the load instruction itself. By applying the vectorization techniques from Part 2, we can reduce the number of slots required, avoiding LG Throttle warp stalls. However, there is a second resource that gets consumed by each outstanding load: While the load is still in flight, the target register cannot be used for any other purpose. Hence, for kernels that require a large amount of registers for their actual computation, occupancy suffers and the amount of parallel loads in flight is limited.


4.2 Asynchronous Copy to Shared Memory

The fundamental problem is that we need a dedicated area of memory into which the LSU can place the loaded value. Two such areas exist in each SM: the register file, and shared memory. Starting with the Ampere architecture, CUDA provides dedicated instructions for moving data from global to shared memory directly. cp.async is a PTX instruction that initiates a direct global→shared memory copy, exposed to CUDA C++ via

// Copy 16 bytes from global src to shared dst, no register involvement
__pipeline_memcpy_async(dst_smem_ptr, src_global_ptr, 16);

Loads can be issued with granularity 4, 8, and 16 bytes, and require natural alignment.

Each thread needs its own address in shared memory

In regular load instructions, SIMT implies that each thread in a warp gets handed physically different target memory, even if they all store to the same variable/named register. In shared memory, you are responsible to ensure every thread gets a distinct address.

Async loads require manual dependency tracking

When working with registers, it is the assembler/hardware's job to ensure that accesses to the same register don't race, by keeping track of the corresponding dependencies and stalling a warp trying to access a register that is not ready. In contrast, when working with shared memory, it is the programmer's job to ensure proper synchronization.

Consequently, asynchronous copies come with two additional functions: __pipeline_commit() and __pipeline_wait_prior(). Commit is used to signal that a batch of data transfers have been issued, and __pipeline_wait_prior() is used to wait for all commited operations to complete.

A simple use of async copy would look like this:

Async copySync copy
int tid = threadIdx.x;
__pipeline_memcpy_async(
   dst_smem_ptr + tid, 
   src_global_ptr + tid, 16);
__pipeline_commit();

// do some work that does 
// not depend on the copy

__pipeline_wait_prior(0);   
// warp stalls until copy is done

float4 reg = dst_smem_ptr[tid];
// warp stalls until smem->rmem is done
f(reg);
int tid = threadIdx.x;
float4 reg = src_global_ptr[tid];



// do some work that does 
// not depend on the copy
// 4 registers are unavailable





// warp stalls until gmem->rmem is done
f(reg);                                

When used like this, the advantage of cp.async is mostly limited to the reduction in register pressure. Stronger benefits arise once we start implementing software pipelining.


4.3 Software Pipelining

In the example above, the latency of memory transfers is only hidden successfully if the computations in the // do some work that does not depend on the copy part of the code take sufficiently long. But what if the kernel does not have any work to be done at that time? We can still overlap transfers and computations by overlapping the load of memory region N + 1 with computation of memory region N.

This is essentially what already happens if two blocks gets scheduled to the same SM: While block Ns warps are waiting for memory, block N+1 can already issue its load instructions. Then, when block N is running its computations, the data transfers for block N+1 keep happening in the background, and ideally, by the time block N finishes and the warp scheduler looks for work in block N+1, it has its data ready and can commence its computations. Block N exits, N+2 gets scheduled on the SM, and ideally, while hiding latencies during execution of N+1, the warp scheduler issues enough instructions to N+2 so that its loads are in flight now.

The scenario above works, but it relies on having sufficiently many warps and on getting lucky in the scheduling. With software pipelining, we move the same process into a single warp, guaranteeing overlap. In order to do so, we must ensure that each warp has sufficient work (thread coarsening); here, we will use a grid-stride loop example. A naive grid-stride loop looks as follows:

for(int idx = tid; idx < N; idx += THREADS_IN_GRID) {
    out[idx] =  f(in[idx]);
}

We can introduce a single level of pipelining as follows:

int idx = tid;
float next;
if(idx < N) {
    next = in[idx];
}
for(int idx = tid; idx < N; idx += G) {
    float tmp = next;
    if(idx + G < N) 
        next = in[idx + G];
    out[idx] =  f(tmp);
}

This is using regular, synchronous loads by employing a scratch register next, which can serve as a target for loading while f is computing.

As a first step, can you rewrite this example using __pipeline_memcpy_async?

✦ Solution
int idx = tid;
__shared__ float next[BLOCK];
if(idx < N) {
    __pipeline_memcpy_async(next + tid, in + idx, 4);
    __pipeline_commit();
}

for(int idx = tid; idx < N; idx += G) {
    __pipeline_wait_prior(0);
    float tmp = next[tid];
    if(idx + G < N) {
        __pipeline_memcpy_async(next + tid, in + idx + G, 4);
        __pipeline_commit();
    }
    out[idx] =  f(tmp);
}

Both examples have 4 × THREADS_IN_GRID many bytes in flight at any given time. One way to increase that number would be to load more numbers in one go. We could emit __pipeline_memcpy_async(next + 4 * tid, in + 4 * idx, 16);. But then, we also need to increase the size of tmp to float4. We must move the requested data from shared memory to registers before we can reuse shared memory as a destination.

If we want to avoid using more registers, we can instead have multiple buffers in shared memory. We can issue loads for one shared memory region, while simultaneously reading data from the other.

int idx = tid;
int stage = 0;
__shared__ float next[2 * BLOCK];
if(idx < N) {
    __pipeline_memcpy_async(next + tid, in + idx, 4);
    __pipeline_commit();
}

for(int idx = tid; idx < N; idx += G) {
    __pipeline_wait_prior(0);
    if(idx + G < N) {
        int s = (stage + 1) % 2;
        __pipeline_memcpy_async(next + tid + BLOCK * s, in + idx + G, 4);
        __pipeline_commit();
    }
    out[idx] =  f(next[tid + stage * BLOCK]);
    stage = (stage + 1) % 2;
}

We use the stage variable to indicate which buffer contains the data that is currently being read, and use the complementary stage to save the next batch of inputs. Note that this transformation leaves a slice of time without bytes in flight: We call __pipeline_wait_prior before any new __pipeline_memcpy_async to ensure that we don't end up waiting for the new transfers, too. This is slightly awkward, and would make building pipelines with more stages quite difficult.

Fortunately, there is one more feature in the synchronization primitives that is designed to handle exactly this situation: We can call __pipeline_wait_prior(N), with an integer argument, to indicate that we only need to wait for all transfers except the last N to be completed. With this, the main loop simplifies to

for(int idx = tid; idx < N; idx += G) {
    if(idx + G < N) {
        int s = (stage + 1) % 2;
        __pipeline_memcpy_async(next + tid + BLOCK * s, in + idx + G, 4);
        __pipeline_commit();
    }
    __pipeline_wait_prior(1);
    out[idx] =  f(next[tid + stage * BLOCK]);
    stage = (stage + 1) % 2;
}

and we ensure that new loads are already enqueued before we do any blocking synchronization. And this pattern generalizes directly to more than two stages.


4.4 Putting It Together — Pipelined SAXPY

Let's look at this in a real kernel. Taking the saxpy kernel from chapter two, we can rewrite it using async copies as

__global__ __launch_bounds__(256, 4) void saxpy_v6(int n, float alpha,
                         const float4* __restrict__ x,
                               float4* __restrict__ y)
{
    constexpr int BLOCK = 256;
    constexpr int STEP  = 3;
    constexpr int TILE  = BLOCK * STEP;

    __shared__ float4 sx[2][TILE];
    __shared__ float4 sy[2][TILE];

    const int tid       = threadIdx.x;
    const int grid_tile = gridDim.x * TILE;

    float4* my_sx = &sx[0][tid];
    float4* my_sy = &sy[0][tid];

    int base = blockIdx.x * TILE;

    for (int s = 0; s < STEP; ++s) {
        const int gi = base + s * BLOCK + tid;
        if (gi < n) {
            __pipeline_memcpy_async(my_sx + s * BLOCK, &x[gi], sizeof(float4));
            __pipeline_memcpy_async(my_sy + s * BLOCK, &y[gi], sizeof(float4));
        }
    }
    __pipeline_commit();

    int cur = 0;

    #pragma unroll 2
    for (; base < n; base += grid_tile) {
        const int next      = cur ^ 1;
        const int next_base = base + grid_tile;

        for (int s = 0; s < STEP; ++s) {
            const int gi = next_base + s * BLOCK + tid;
            if (gi < n) {
                __pipeline_memcpy_async(my_sx + next * TILE + s * BLOCK, &x[gi], sizeof(float4));
                __pipeline_memcpy_async(my_sy + next * TILE + s * BLOCK, &y[gi], sizeof(float4));
            }
        }
        __pipeline_commit();

        __pipeline_wait_prior(1);

        for (int s = 0; s < STEP; ++s) {
            const int gi = base + s * BLOCK + tid;
            if (gi < n) {
                const float4 vx = my_sx[cur * TILE + s * BLOCK];
                const float4 vy = my_sy[cur * TILE + s * BLOCK];
                y[gi] = { alpha * vx.x + vy.x,
                          alpha * vx.y + vy.y,
                          alpha * vx.z + vy.z,
                          alpha * vx.w + vy.w };
            }
        }

        cur = next;
    }
}

In this kernel, we use a 2-stage pipeline and 48 bytes per thread per pipeline stage, for each of the two input arrays x and y. This leads to a total shared memory consumption of 256 × 48 × 2 × 2 = 49152 bytes, or 48 KiB, the maximum amount of allowed shared memory per block. If we wanted to use more, we would have to switch to dynamic shared memory and set cudaFuncAttributeMaxDynamicSharedMemorySize with cudaFuncSetAttribute.

You may also notice two micro-optimizations made in this code: First, each thread pre-computes a base pointer for its own shared memory region, avoiding the need for additional indexing arithmetic in the hot loop. Second, the loop is annotated with #pragma unroll. As this loop does not have a simple structure, the compiler is unlikely to do this on its own, yet by unrolling twice we turn the cur stage variable into something that can be resolved at compile time.

Metric V5 V6
Instructions 5,767,168 6,088,856
Compute util 22.60% 12.96%
DRAM BW 79.06% 70.71%
DRAM Read 256.0 MiB 256.0 MiB
DRAM Write 86.0 MiB 83.4 MiB
Cycles 78,565 87,243
Duration 68.1 µs 75.6 µs
Loads 524,288 0
Stores 262,144 262,144
LDGSTS 0 552,720
📊 Profiling details

Looking first at the number of instructions of different type that get executed, we see a surprisingly large number of LDS shared loads. The code using vectorized float4 load instruction fetching 16 bytes each, the same amount as LDGSTS deposits in shared memory per instruction. Consequently, we would expect to see the same number of LDS and `LDGSTS instructions.

[ncu_instruction_chart error: No JSON cache for `examples/part4/reports/saxpy-v6-b4000.ncu-rep` and unable to generate one (report present: False)]

Where are these additional LDS instructions coming from? Checking the generated SASS shows the following pattern:

@!PT  LDS RZ, [RZ]
@!PT  LDS RZ, [RZ]
@!PT  LDS RZ, [RZ]
@!P0  LDGSTS.E.BYPASS.128 desc[UR10][R4.64], [R33+0x3000]

Before the LDGSTS operation, there are three dummy LDS instructions, reading from address 0 -- RZ is the zero register -- and seemingly writing it to RZ. Except that no actual operation takes place, because these three instructions are predicated with !PT, the negation of the true predicate.

However, despite SASS generally being described as the low-level GPU assembly language, it is only a lossy representation of the true machine code (see, .e.g, chapter 2 here). In addition to the visible SASS instruction itself, the machine code also includes control data. For example, if an instruction has a latency of four cycles, and the following instruction requires its result as an input, then the control data might inform the GPU that the warp will be ineligible for execution until the latency has elapsed, thus avoiding any data hazards.

Presumably, these LDS instructions carry some control information, but as these details are not made public by NVidia, a definitive answer would require careful reverse engineering. (Here is the question as asked on NVidia's forums.)


AdvancedL1 Bypass

The cp.async instruction underlying __pipeline_memcpy_async provides a mechanism for controlling the behaviour of loads with respect to the L1 cache: It can either fetch the value into L1 and then to shared memory, or bypass the L1 cache altogether. When would you use one over the other? If the fetch covers entire cache lines, then there is generally no benefit to caching in L1 and the bypass route is both more efficient and avoids polluting the L1 cache. On the other hand, if only a partial cache line is fetched, and the rest of the cache line will be used later, then caching in L1 speeds up the subsequent async load.

__pipeline_memcpy_async does not expose the bypass configuration to the user, but it employs a heuristic informed by the considerations above: If the fetch size is 16 bytes per thread, then the bypass is used, otherwise the fetch is cached in L1.

Unfortunately, there is a pathological case in which this strategy can be strongly suboptimal. Just because an instruction requests an amount of data filling a full cache line does not mean that this request will be serviced with a single transfer. In particular, with size 16, the requirements of __pipeline_memcpy_async are only that its pointers be 16-byte aligned. The cache, in contrast, is organized into 32-byte sized sectors, and transfers happen at the granularity of sectors. If a request gets split into two transfers, we end up fetching partial data from one cache line from L2, but as there is no caching in L1, that information is discarded when writing to shared memory, and the same data needs to be requested from L2 a second time. This is visible in NCU by showing larger L1-bypass transfer sizes than actually requested by the kernel.

One fix would be to switch from pipeline_memcpy_async to inline ptx and manually specify the cp.async.ca modifier. However, if possible, a better fix is to ensure proper alignment.

4.6 Exercise — Max pooling

Implement 1D max pooling with radius 15 (a 31-element sliding window) over a float32 array.

Interface

The function to be implemented is:

void maxpool15(float* out, const float* in, int n);

Here, n is the number of elements in in and out. For each output index i,

out[i] = max over j in [i-15, i+15] of in[j]

where indices j outside [0, n) are treated as -inf (i.e. they never contribute to the maximum). Equivalently, the window is clamped to the valid range [max(0, i-15), min(n-1, i+15)].

n can be an arbitrary positive 32-bit integer, 0 < n < 2^31. The starting addresses of in and out are guaranteed to align to 256 bytes, and the input values in are positive single-precision floating-point numbers.

You can test your solution locally:

python run.py test

From here on, we require at least SM_90 / Hopper