CUDA reduce non-contiguous sub-arrays

Question

I am writing a function for a library which accepts a large array (in GPU memory) of a power-of-2 number of elements. This function must sum non-contiguous sub-arrays (of equal length, also a power-of-2), to produce a smaller (or rarely, equally sized) array. This is a GPU version of the OpenMP function described here.

For example,

in[8] = { ... }
out = { in[0]+in[2],  in[1]+in[3],  in[4]+in[6],  in[5]+in[7] }

The elements of the reduced sub-arrays are determined by a user-given list of bit indices, bitInds. They inform how the index (imagined as a bitstring) of an elemement of in is mapped to an index of out.

For example, if

bits = { 0, 2 }

then in index 0 = 000 maps to out[0], and in index 4 = 100 maps to out[2].

The length of bits can range from 1 to log2(length(in)). The length of out is pow(2, length(bits)). This means out can be just as large as in, or half as large, or a quarter as large, etc.

Preparing the device memory for this function, and performing well structured reductions within CUDA kernels seems challenging, and I'm unsure how to start. Since in is gauranteed very large, it is important threads access in locally and sequentially. How can I perform pow(2,length(bits)) reductions of possibly non-contiguous sub-arrays of in efficiently?

Do you have a reference on CUDA atomics you would recommend? — Anti Earth, Jun 28 '21 at 22:42
what you are asking about has some similarities to histogramming, so [this](https://developer.nvidia.com/blog/gpu-pro-tip-fast-histograms-using-shared-atomics-maxwell/) You could use the shared memory approach if the length of your out array is in the range of 4096 or less. Otherwise just the regular global atomics. — Robert Crovella, Jun 28 '21 at 23:11
Except for the lowest bits, global memory is random access. The cache lines are 128 bytes wide. So continuous 128 bytes of especially your input, but also your output should be written by a warp in one memory request (assuming 4 bytes/element). Your kernels can be templated or auto-generated depending on whether the lowest 5 bits (indices 0 to 31 gives lower 128 bytes) are part of bitInds. When reading/writing you can shuffle the elements around or use shared memory. — Sebastian, Jun 29 '21 at 05:19
For the higher bits: For a large output array you have many independent reductions, then you can do one reduction per thread (no synchronization necessary), for a medium array you would reduce inside a block at the end and for a small output array, you could write an intermediate result with a larger output array with the existing methods first and then reduce a few bits more. When you state your problem so generally, there are a lot of cases with different optimization strategies involved. The cache and locality beyond 128 bytes won't help much, as each element is read only once. — Sebastian, Jun 29 '21 at 05:24

Anti Earth · Answer 1 · 2021-07-03T16:57:23.220

I tried three implementations and compared their performance.

A: As per @RobertCrovella's example, using atomics on per-block exclusive global memory, and reducing between blocks thereafter.
B: Having each thread directly atomically write to global memory.
C: Having threads write to block-local shared memory, and finally atomically reducing them to global memory.

The best solution in both performance and size constraints was surprisingly the simplest, B. I present pseudocode and describe the trade-offs for these solutions in order of their complexity.

B

This method has each thread atomically write to global memory, directly to the final output. Besides fitting out and in in global memory, it imposes no additional constraints on their sizes.

__global__ void myfunc_kernel(double* in, inLen, double* out, ...) {
    
    // each thread handles one element of 'in'
    int inInd = blockIdx.x*blockDim.x + threadIdx.x;
    if (inInd >= inLen)
        return;

    // determine 'outInd' from other args
    int outInd = ...
    
    // atomically add directly to out global memory (vs all threads)
    atomicAdd(&out[outInd], in[inInd]);
}

void myfunc(double* in, int inLen, double* out,  ...) {

    // copy unspecified additional args into device memory (shared, if fits)
    double d_in = ...

    // create global memory for final output (initialised to zero)
    double d_out = ...
    cudaMemset(d_out, 0, ...);
    
    // allocate one thread per element of 'in'
    int numThreadsPerBlock = 128;
    int numBlocks = ceil(inLen / (double) numThreadsPerBlock);

    myfunc_kernel<<numBlocks, numThreadsPerBlock>>(d_in, inLen, d_out, ...);

    // copy output to host, and clean up
    cudaMemcpy(out, d_out, ...);
    cudaFree(...);
}

C

This method assigns each block shared memory, into which threads in a block locally reduce, via atomics. Thereafter, the shared memory is reduced between blocks into the output global memory, via atomics. Having to fit a local out into shared memory constrains outLen to at most (e.g.) 4096 (depending on the compute capability).

__global__ void myfunc_kernel(double* in, int inLen, double* out, ...) {
    
    // each thread handles one element of 'in'
    int inInd = blockIdx.x*blockDim.x + threadIdx.x;
    if (inInd >= inLen)
        return;

    // each block has a local copy of 'out', in shared memory
    extern __shared__ double outBlock[];
    
    // efficiently initialise shared memory to zero
    // (looped in case fewer threads than outLen)
    for (int i=threadIdx.x; i<outLen; i+=blockDim.x)
        outBlock[i] = 0;
    __syncthreads();

    // determine 'outInd' from other args
    int outInd = ...
    
    // atomically contribute thread's 'in' value to block shared memory
    // (vs other threads in block)
    atomicAdd(&outBlock[outInd], in[inInd]);
    __syncthreads();

    // keep one (or fewer) thread(s) per output element alive in each block
    if (threadIdx.x >= outLen)
        return;

    // remaining threads atomically reduced shared memory into global
    // (looped in-case fewer threads than 'outLen')
    for (int i=threadIdx.x; i<outLen; i+=blockDim.x)
        // (vs all surviving threads)
        atomicAdd(&out[i], outBlock[i]);

    /* This final reduction may be possible in parallel/tiers,
     * though documentation & examples about reducing shared memory
     * between blocks is disappointingly scarce
     */
}

void myFunc(double* in, int inLen, double* out, int outLen, ...) {
    
    // if 'out' cannot fit in each block's shared memory, error and quit
    if (outLen > 4096)
        ...
       
    // copy unspecified additional args into device memory
    double d_in = ...

    // create global memory for final output (initialised to zero)
    double d_out = ...
    cudaMemset(d_out, 0, ...);
    
    // allocate one thread per element of 'in'
    int numThreadsPerBlock = 128;
    int numBlocks = ceil(inLen / (double) numThreadsPerBlock);

    // allocate shared memory to fit 'out' in each block
    size_t mem = outLen * sizeof(double);
    myfunc_kernel<<numBlocks, numThreadsPerBlock, mem>>(d_in, inLen, d_out, ...);

    // copy output to host, and clean up
    cudaMemcpy(out, d_out, ...);
    cudaFree(...);
}

Note that all reductions were ultimately performed atomically. This should not be necessary, since after the shared memory is finalised, the reduction into global memory has strict structure. Yet, I am unable to find documentation or a single demonstration online of how to collectively reduce shared memory arrays into global memory.

A

This method uses the same paradigm as this histogram example. It creates #numBlocks copies of out in global memory, so that each block writes to exclusive memory. Threads within a block write atomically to their block's global memory partition. Thereafter, a second kernel reduces these partitions into the final out array. This significantly constrains outLen, to be smaller than deviceGlobalMem/numBlocks.

__global__ void myfunc_populate_kernel(double* in, int inLen, double* outs, int outLen, ...) {
    
    // each thread handles one element of 'in'
    int inInd = blockIdx.x*blockDim.x + threadIdx.x;
    if (inInd >= inLen)
        return;

    // determine 'outInd' from other args
    int outInd = ...

    // map to this block's exclusive subarray in global memory
    int blockOutInd = blockIdx.x*outLen + outInd;
    
    // atomically contribute thread's 'in' value to block's global memory
    // (vs threads in same block)
    atomicAdd(&outs[outInd], in[inInd]);
}

__global__ void myfunc_reduce_kernel(double* out, double* outs, int outLen, int numOuts) {

    // each thread handles one element of 'out'
    int outInd = blockIdx.x*blockDim.x + threadIdx.x;
    if (outInd >= outLen)
        return;
  
    // and determines it by reducing previous per-block partitions
    double total = 0;
    for (int n=0; n<numOuts; n++)
        total += outs[outInd + n*outLen];

    out[outInd] = total;
}

void myfunc(double* in, int inLen, double* outs, int outLen, ...) {
    
    int numThreadsPerBlock = 128;
    int numPopBlocks = ceil(inLen / (double) numThreadsPerBlock);
 
    // if #numPopBlocks copies of 'out' cannot fit in global memory, error and exit
    if (outLen * numPopBlocks > ...)
        ...

    // copy unspecified additional args into device memory (shared, if fits)
    double d_in = ...

    // create a working 'out' for each block, in global memory (init to zero)
    size_t mem = outLen * numPopBlocks * sizeof(double);
    double d_outs = ...
    cudaMalloc(&d_outs, mem);
    cudaMemset(d_out, 0, mem);
    
    // populate each blocks partition of global memory 'd_outs'
    myfunc_populate_kernel<<numPopBlocks, numThreadsPerBlock>>(d_in, inLen, d_outs, outLen, ...);

    // create global memory for final output
    double d_out = ...

    // reduce each blocks partition of global memory into output 'd_out'
    int numReduceBlocks = ceil(outLen / (double) numThreadsPerBlock);
    myfunc_reduce_kernel<<numReduceBlocks, numThreadsPerBlock>>(d_out, d_outs, outLen, numPopBolocks);

    //copy output to host, and clean up
    cudaMemcpy(out, d_out, ...);
    cudaFree(...);
}

Comparison

I tested these methods with a Quadro P6000 (24GB, 6.1 CC), with inLen = 2^5, 2^10, 2^15, 2^20, 2^25 and (for all values) outLen = m2^5, 2^10, 2^15 (outLen <= inLen). Here is a table of the runtimes:

╔══════════╦═════════╦══════════╦══════════╦══════════╗
║  inLen   ║ outLen  ║  A       ║  B       ║  C       ║
╠══════════╬═════════╬══════════╬══════════╬══════════╣
║       32 ║      32 ║ 0.000067 ║ 0.000071 ║ 0.000039 ║
║     1024 ║      32 ║ 0.000244 ║ 0.000044 ║ 0.000071 ║
║     1024 ║    1024 ║ 0.000265 ║ 0.000043 ║ 0.000064 ║
║    32768 ║      32 ║ 0.000290 ║ 0.000077 ║ 0.000080 ║
║    32768 ║    1024 ║ 0.000287 ║ 0.000059 ║ 0.000096 ║
║    32768 ║   32768 ║ 0.000762 ║ 0.000111 ║ N/A      ║
║  1048576 ║      32 ║ 0.000984 ║ 0.000793 ║ 0.000254 ║
║  1048576 ║    1024 ║ 0.001381 ║ 0.000343 ║ 0.002302 ║
║  1048576 ║   32768 ║ 0.017547 ║ 0.000899 ║ N/A      ║
║  1048576 ║ 1048576 ║ N/A      ║ 0.006092 ║ N/A      ║
║ 33554432 ║      32 ║ 0.021068 ║ 0.022684 ║ 0.008079 ║
║ 33554432 ║    1024 ║ 0.032839 ║ 0.008190 ║ 0.014071 ║
║ 33554432 ║   32768 ║ 0.013734 ║ 0.011915 ║ N/A      ║
╚══════════╩═════════╩══════════╩══════════╩══════════╝

Even despite having fewer limitations, method B is a clear winner, as more clearly illustrated here:

╔══════════════╦══════╦═════╗
║ outLen/inLen ║ A/B  ║ C/B ║
╠══════════════╬══════╬═════╣
║            1 ║ 0.9  ║ 0.5 ║
║           32 ║ 5.5  ║ 1.6 ║
║            1 ║ 6.2  ║ 1.5 ║
║         1024 ║ 3.8  ║ 1.0 ║
║           32 ║ 4.9  ║ 1.6 ║
║            1 ║ 6.9  ║     ║
║        32768 ║ 1.2  ║ 0.3 ║
║         1024 ║ 4.0  ║ 6.7 ║
║           32 ║ 19.5 ║     ║
║            1 ║      ║     ║
║      1048576 ║ 0.9  ║ 0.4 ║
║        32768 ║ 4.0  ║ 1.7 ║
║         1024 ║ 1.2  ║     ║
╚══════════════╩══════╩═════╝

Pitfalls

It is difficult to be polite about the state of CUDA documentation and the examples online. So I won't comment, but offer up these warnings of pitfalls to the puzzled reader:

Beware that while shared memory seems to be initialised to zero at the first execution of the kernel, it is not re-initialised in subsequent executions. Hence you must re-initialise manually within a kernel (there is no convenient CUDA function).
atomicAdd is missing an overload for double types. The doc claims this is true only for compute capabilities < 60, but my code would not compile with my 62 CC GPU with no manual overload. Furthermore, the doc's code-snippet of #if __CUDA_ARCH__ < 600 is wrong, since newer arch precompilers don't define __CUDA_ARCH__. View a correct macro here
Beware that the popular parallel reduction paradigms you view online will assume all data is in the global memory, or in a single block's shared memory. Reducing shared memory into a global structure seems esoteric.

Determining bitInd

For the interested reader to better understand the out access pattern, here is how bitInd is determined from each thread, of which (between all blocks combined) there are inLen. Note the actual index types are long long int, since inLen and outLen can both be as big as 2^30 = 1073741824

__global__ void myfunc_kernel(
    double* in, int inLen, double* out, int* bitInds, int numBits
) {

    long long int inInd = blockIdx.x*blockDim.x + threadIdx.x;
    if (inInd >= inLen)
        return;

    // determine the number encoded by bits {bitInds} of inInd
    long long int outInd = 0;
    for (int b=0; b<numBits; b++) {
        int bit = (inInd >> bitInds[b]) & 1;
        outInd += bit * (1LL << b);
    }

    ...
}

For some possible inputs of bitInds, the access pattern is simple. For example:

If bitInds = {0, 1, ..., n-1} (contiguous increasing from 0), then outInd = inInd % 2^n, and our kernel is (sub-optimally) performing reductions of contiguous partitions of in.
If bitInds = {n, n+1, n+2, ..., log2(inLen)-1}, then outInd = floor(inInd / 2^(log2(inLen) - n)), or simply outInd = floor(inInd 2^n / inLen).

However, if bitInds is an arbitrarily ordered subset of indices {0, ..., log2(inLen)-1}, then outInd must be determined from each bit as performed above.

It seems wasteful to have all additions require atomic operations on global memory. Can't you have threads add up some of the elements locally, before updating global memory? — einpoklum, Jul 01 '21 at 22:44
The function this library will go into doesn't perform any auto-tuning, and tries to make little assumptions of the user's GPU. Furthermore, the size of the inputs (as you can see) varies exponentially, and all extremities are realistic use-cases. It is therefore hard to make optimisations like that — Anti Earth, Jul 01 '21 at 22:51
My comment has nothing to do with auto-tuning or assumptions about the user's GPU. As for the input size - it doesn't matter. That is, if the input is so small as to be covered by one element for, say, each schedulable warp a GPU can hold at a time - then optimizations don't matter either way, since it won't take a long time anyway. If the input is larger than than (say, hundreds of thousands of elements, or more), then my suggestion is pertinent. — einpoklum, Jul 02 '21 at 07:50
shared memory is not guaranteed to be initialized even on the first execution. You cannot rely on initialization of shared memory, you must do it yourself, always, for correctness. I had no trouble compiling `double` `atomicAdd` for cc6.2. You have to specify the target you are compiling for. The compiler (i.e. `nvcc`) pays no attention to any GPUs you may or may not have. One problem I had in trying to help you is that I still don't understand how the `outInd` are selected. In all your examples you seem to leave that out, but it's obviously important for actual performance analysis. — Robert Crovella, Jul 02 '21 at 14:43
@RobertCrovella good point, though I'd hoped you'd believe me that the access pattern cannot be optimised in general :) I've added a new section to my answer explaining how `outInd` is calculated. Regarding `atomicAdd` - yep I understand how to supply the compute capability — Anti Earth, Jul 03 '21 at 16:58
Having answered your own question, could you please accept the answer so that this question falls off the unanswered queue for the CUDA tag? — talonmies, Dec 12 '22 at 04:09

CUDA reduce non-contiguous sub-arrays

1 Answers1

B

C

A

Comparison

Pitfalls

Determining bitInd