Is there a way to reduce stall latency from syncthreads() when doing a reduction?

Question

My code reads data from global memory, performs some calculation, which at the end needs to be normalized by the sum of all data in that particular block. I am using techniques from Mark Harris's "Optimizing Parallel Reduction in CUDA" to do the sum and that works significantly faster than summing via a loop (in my original code, a gain of almost 5x). However, when I'm done with the sum, __syncthreads() is stalling the warps quite a bit and hence is the bottleneck of my code.

My SM and memory usage are at ~39% and 8% respectively, which I'm assuming means my code is latency-bound. My theoretical warps per scheduler is 8 and active warps is 7.71 but eligble warps is at only 0.24. My "No eligible [%]" is 77%. I'm assuming the stall wait and stall barrier, which are 13.59 and 13.04 cycles per instruction respectively, are responsible for the low warps eligibility. I've included sampling data values for the stalled lines (I'm particularly concerned about stall 1 as it's 29%) in the code comments below.

My actual code is a little more complicated, mainly due to a larger number of variables to be read / calculated and different equations, but my kernel is pretty much what I have included below.

#include <stdio.h>

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }

inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort = true) {
   if (code != cudaSuccess) {
      fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}

#define length_1 256
#define length_2 100
#define length_3 250

__device__ float * d;

__global__ void kernel() {

    int tid = threadIdx.x;

    __shared__ float d_for_current_block[512];
    d_for_current_block[tid] = d[blockIdx.x * blockDim.x + threadIdx.x]; // Move to shared memory because originally d isn't coalesced.

    for (int k = 0; k < length_2; k++) {
        __shared__ float power_sums[512];
        __shared__ float powers[512];

        float power = (2 * d_for_current_block[tid] - 1.0) / 1.0;   // Stall 5; sampling data [all] = 1,111
        powers[tid] = power;
        power_sums[tid] = power * power;

        // Sum powers of all threads in current block.

        __syncthreads();

        if (tid < 128)                                      // Stall 4; sampling data [all] = 1,159
            power_sums[tid] += power_sums[tid + 128];

        __syncthreads();

        if (tid < 64)                                       // Stall 3; sampling data [all] = 2,213
            power_sums[tid] += power_sums[tid + 64];

        __syncthreads();

        if (tid < 32) {                                     // Stall 2; sampling data [all] = 3,100
            power_sums[tid] += power_sums[tid + 32];
            power_sums[tid] += power_sums[tid + 16];
            power_sums[tid] += power_sums[tid + 8];
            power_sums[tid] += power_sums[tid + 4];
            power_sums[tid] += power_sums[tid + 2];
            power_sums[tid] += power_sums[tid + 1];
        }

        __syncthreads();

        powers[tid] = powers[tid] / power_sums[0];        // Stall 1; sampling data [all] = 17,016
    }
}

int main(){

    float * h, * d_ptr;
    h = (float *) malloc(length_1 * length_2 * length_3 * sizeof(float));

    for (int k = 0; k < length_3; k++)
        for (int j = 0; j < length_1; j++)
            for (int i = 0; i < length_2; i++)
                h[i] = 100 * (float) rand() / (float)(RAND_MAX);

    gpuErrchk( cudaMalloc((void**) &d_ptr, length_1 * length_2 * length_3 * sizeof(float)) );
    gpuErrchk( cudaMemcpy(d_ptr, h, length_1 * length_2 * length_3 * sizeof(float), cudaMemcpyHostToDevice) );
    gpuErrchk( cudaMemcpyToSymbol(d, &d_ptr, sizeof(float*)) );

    kernel<<<length_3, length_1>>>();

    gpuErrchk( cudaFree(d_ptr) );
    free( h );

    gpuErrchk( cudaPeekAtLastError() );
    gpuErrchk( cudaDeviceSynchronize() );

}

My question is, is there a way to reduce the latency barrier wait is creating? I understand where it's coming from but I was wondering if there is a way to restructure the code or something else, that I'm not seeing, that could decrease the latency. I've played around with changing block size and it doesn't make that much of a difference, which makes sense as in any case, all the threads need to be done before a warp can process any further.

From my understanding, one way to hide latency is to have other computations warps can do while they wait but I do not have any other calculation that needs to be done other than getting the normalized value (power_sums[0]) and using it to normalize all my powers. Is there something else that I can do or is this pretty much what I can expect to get for this particular problem?

EDIT:

Updated code that addresses Robert Crovella's comments below. New numbers: SM Usage = 39%; Memory Usage = 67%; Active Wraps = 7.95; Eligible wrap per scheduler = 0.25; No Eligible [%] = 76.04%.

In my original full code, SM usage and memory usage are 49% and 5% respectively. Other numbers are similar.

#include <stdio.h>

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }

inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort = true) {
   if (code != cudaSuccess) {
      fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}

#define length_1 256
#define length_2 100
#define length_3 250

__device__ float * d;

__global__ void kernel(float * d_all_sums) {

    int tid = threadIdx.x;
    float final = 0.0;

    for (int i = 0; i < length_2; i++) {
        __shared__ float power_sums[512];
        __shared__ float powers[512];

        int index = blockIdx.x * blockDim.x * length_2 + threadIdx.x * length_2 + i;

        float po = d[index];                   // Stall 2; sampling data [all] = 4,114

        float power = (2 * po - 1.0) / 2.0;   // Stall 5; sampling data [all] = 1,111
        powers[tid] = power;
        power_sums[tid] = power * power;

        // Sum powers of all threads in current block.

        __syncthreads();

        if (tid < 128)                                      // Stall 5; sampling data [all] = 1,172
            power_sums[tid] += power_sums[tid + 128];

        __syncthreads();

        if (tid < 64)                                       // Stall 4; sampling data [all] = 2,037
            power_sums[tid] += power_sums[tid + 64];

        __syncthreads();

        if (tid < 32) {                                     // Stall 3; sampling data [all] = 2,843
            power_sums[tid] += power_sums[tid + 32];
            power_sums[tid] += power_sums[tid + 16];
            power_sums[tid] += power_sums[tid + 8];
            power_sums[tid] += power_sums[tid + 4];
            power_sums[tid] += power_sums[tid + 2];
            power_sums[tid] += power_sums[tid + 1];
        }

        __syncthreads();

        final  += powers[tid] / power_sums[0];        // Stall 1; sampling data [all] = 15,775
    }

    if (tid == 0)
        d_all_sums[blockIdx.x] = final;

}

int main(){

    float * h, * h_all_sums;
    float * d_ptr, * d_all_sums;
    h = (float *) malloc(length_1 * length_2 * length_3 * sizeof(float));
    h_all_sums = (float *) malloc(length_3 * sizeof(float));
    memset(h_all_sums, 0.0, length_3);

    for (int k = 0; k < length_3; k++)
        for (int j = 0; j < length_1; j++)
            for (int i = 0; i < length_2; i++)
                h[k * length_1 * length_2 + j * length_2 + i] = 100 * (float) rand() / (float)(RAND_MAX);

    gpuErrchk( cudaMalloc((void**) &d_ptr, length_1 * length_2 * length_3 * sizeof(float)) );
    gpuErrchk( cudaMemcpy(d_ptr, h, length_1 * length_2 * length_3 * sizeof(float), cudaMemcpyHostToDevice) );
    gpuErrchk( cudaMemcpyToSymbol(d, &d_ptr, sizeof(float*)) );

    gpuErrchk( cudaMalloc( (void**) &d_all_sums, length_3 * sizeof(float) ) );

    gpuErrchk( cudaMalloc((void**) &d_ptr, length_1 * length_2 * length_3 * sizeof(float)) );
    gpuErrchk( cudaMemcpy(d_ptr, h, length_1 * length_2 * length_3 * sizeof(float), cudaMemcpyHostToDevice) );

    kernel<<<length_3, length_1>>>(d_all_sums);

    gpuErrchk( cudaMemcpy(h_all_sums, d_all_sums, length_3 * sizeof(float), cudaMemcpyDeviceToHost) );

    FILE *f = fopen("test_output.txt", "w");
    if (f != NULL) {
        for (int k = 0; k < length_3; k++)
            fprintf(f, "k = %d; power = %f.\n", k, h_all_sums[k]);
    }

    fclose( f );

    gpuErrchk( cudaFree(d_ptr) );
    gpuErrchk( cudaFree(d_all_sums) );
    free( h );
    free( h_all_sums );

    gpuErrchk( cudaPeekAtLastError() );
    gpuErrchk( cudaDeviceSynchronize() );

}

The code you have inside this section: `if (tid < 32) { ... }` is no longer considered proper/legal coding, which you can discover by running your code with the `--tool racecheck` option for either `cuda-memcheck` or `compute-sanitizer`. From a performance perspective, you could switch to a [warp-shuffle parallel reduction](https://developer.nvidia.com/blog/faster-parallel-reductions-kepler/), which may be faster, but if you retain your current shared-memory sweep approach, the usage of `__syncthreads()` and its effects is unavoidable. — Robert Crovella, Jul 23 '21 at 22:52
The code you have here may not be the best for benchmarking because the kernel does not modify any global state. — Robert Crovella, Jul 23 '21 at 22:58
Your loop in `k` also computes the same values for `powers` at each pass of the loop, so it isn't very useful for comparison. The reason I'm pointing these things out is because to offer a suggestion for something that might be better, I need something sensible to compare to. What you have here isn't. — Robert Crovella, Jul 23 '21 at 23:05
I'm sorry about that! I've updated the code to have power that depends on k and also writes back to global memory (and eventually writes to a file). I'll look at wrap-shuffle parallel reduction and the CUB library mentioned in the link you gave and give them a shot. — tripatheea, Jul 24 '21 at 01:09
the warp-shuffle method doesn't appear to be any faster according to my testing on V100. — Robert Crovella, Jul 24 '21 at 02:08
My test (in the real full code) didn't show much difference either. However, the shared memory approach has fewer register usage (42 vs 34), fewer number of cycles (141M vs 71M), and a larger arithmetic intensity and performance in the roofline. If I want to stay with the shared memory approach, what's a way that's now considered proper for the `if (tid < 32) { ... }` part? — tripatheea, Jul 24 '21 at 06:30
add a `__syncwarp()` on each line, like so: `power_sums[tid] += power_sums[tid + 32]; __syncwarp();` The `__syncwarp()` on the last line: `power_sums[tid] += power_sums[tid + 1];` is unnecessary, because immediately after that you have a `__syncthreads()` (which *is* necessary). — Robert Crovella, Jul 24 '21 at 16:04

score 1 · Answer 1 · answered Jul 25 '21 at 15:52

Additionally the useful comments, one efficient way to reduce the number of synchronization is to perform a N-way reduction with N>2 rather than a binary reduction. A 4-way reduction should be significantly faster as it enable you to remove one __syncthreads (it should also increase arithmetic pressure but this should not be a problem on this part of the code on most GPUs).

Additionally, you do not need to update final just after a reduction. The reduced value can be temporary saved in the shared memory (in another array) and read the reduced values later (do all the update after the loop). This enable you to remove another one __syncthreads from the i-based loop.

Moreover, the computation is very cheap compare to the cost of the reduction and writing in power_sums initially to then read it just after that is not very efficient either. Shared memory reads/writes are relatively slow compared to basic arithmetic instructions. You can just assign 2x more work to each thread and perform the reduction on the fly : each thread can compute 2 different power_sums at a time using register and sum them in registers (with a 2x smaller block-size). If the computation is more expensive in practice, then there is a tread-off to find between benefit of having more thread per block and the reduction overhead. Based on my past experiences, it is often better to use small blocks in such a case.

It is worth noting that relatively new GPUs can perform atomicAdd operation very efficiently per block (because of a clever hardware-based reduction). Atomic enable you to remove (almost?) all the __syncthreads. Note that the hardware may still use some implicit synchronization regarding the target GPU. The resulting code should also be simpler / easier-to-read. However, the order of atomics instruction is not deterministic : it can change from one GPU from another or possibly between 2 kernel execution on the same GPU. This must be taken into account if you care about the numerical stability of the kernel.

Note also that CUB provide an optimized BlockReduce primitive to compute this reduction. CUB will likely choose the best strategy regarding the target GPU used.

I did try using CUB but that ended up being almost 5x slower. I'm assuming because my arrays to sum are relatively small (512 per iteration), the overhead ends up becoming the bottleneck. I did try atomicAdd and a N-way reduction (I'm assuming this means we sum multiple elements per thread?) and it did make my main code a little bit faster (~10%). It brought the compute and memory usage to similar higher numbers (~35%) but made the branch divergence much much worse. Do atomic operations cause the threads to diverge in general? It's not immediately obvious to me why they would. — tripatheea, Jul 26 '21 at 05:25
It is a bit surprising for CUB since I though BlockReduce was specifically designed for such a use. For the atomic operation I am not aware of such a weird effect. Assuming you have no condition in your code, threads of the same warp working on the same atomic variable will perform a SIMD reduction without any divergence on quite recent GPU architectures. Thread atomic accesses of the same block could be sequentiallized (they are on my GPU). But you can hardly do something better than recent hardware implementations. Still, the GPU might consider this as divergence. What is your target GPU? — Jérôme Richard, Jul 26 '21 at 17:42
I used the CUB example at https://nvlabs.github.io/cub/example_block_reduce_8cu-example.html#_a0 I also tried thrust but that was even slower. Regarding thread divergence, I found this https://stackoverflow.com/questions/35156157/cuda-atomics-causes-branch-divergence which explains how shared memory atomics can cause divergence. I changed the code to use global memory and that removed all the divergence but didn't make much difference in runtime, probably due to overhead from global memory. Cluster I'd run this on has V100 & P100 but I do recognize that they have different microarchitectures. — tripatheea, Jul 26 '21 at 19:41

Is there a way to reduce stall latency from syncthreads() when doing a reduction?

1 Answers1