I tried three implementations and compared their performance.
- A: As per @RobertCrovella's example, using atomics on per-block exclusive global memory, and reducing between blocks thereafter.
- B: Having each thread directly atomically write to global memory.
- C: Having threads write to block-local shared memory, and finally atomically reducing them to global memory.
The best solution in both performance and size constraints was surprisingly the simplest, B. I present pseudocode and describe the trade-offs for these solutions in order of their complexity.
B
This method has each thread atomically write to global memory, directly to the final output. Besides fitting out
and in
in global memory, it imposes no additional constraints on their sizes.
__global__ void myfunc_kernel(double* in, inLen, double* out, ...) {
// each thread handles one element of 'in'
int inInd = blockIdx.x*blockDim.x + threadIdx.x;
if (inInd >= inLen)
return;
// determine 'outInd' from other args
int outInd = ...
// atomically add directly to out global memory (vs all threads)
atomicAdd(&out[outInd], in[inInd]);
}
void myfunc(double* in, int inLen, double* out, ...) {
// copy unspecified additional args into device memory (shared, if fits)
double d_in = ...
// create global memory for final output (initialised to zero)
double d_out = ...
cudaMemset(d_out, 0, ...);
// allocate one thread per element of 'in'
int numThreadsPerBlock = 128;
int numBlocks = ceil(inLen / (double) numThreadsPerBlock);
myfunc_kernel<<numBlocks, numThreadsPerBlock>>(d_in, inLen, d_out, ...);
// copy output to host, and clean up
cudaMemcpy(out, d_out, ...);
cudaFree(...);
}
C
This method assigns each block shared memory, into which threads in a block locally reduce, via atomics. Thereafter, the shared memory is reduced between blocks into the output global memory, via atomics. Having to fit a local out
into shared memory constrains outLen
to at most (e.g.) 4096
(depending on the compute capability).
__global__ void myfunc_kernel(double* in, int inLen, double* out, ...) {
// each thread handles one element of 'in'
int inInd = blockIdx.x*blockDim.x + threadIdx.x;
if (inInd >= inLen)
return;
// each block has a local copy of 'out', in shared memory
extern __shared__ double outBlock[];
// efficiently initialise shared memory to zero
// (looped in case fewer threads than outLen)
for (int i=threadIdx.x; i<outLen; i+=blockDim.x)
outBlock[i] = 0;
__syncthreads();
// determine 'outInd' from other args
int outInd = ...
// atomically contribute thread's 'in' value to block shared memory
// (vs other threads in block)
atomicAdd(&outBlock[outInd], in[inInd]);
__syncthreads();
// keep one (or fewer) thread(s) per output element alive in each block
if (threadIdx.x >= outLen)
return;
// remaining threads atomically reduced shared memory into global
// (looped in-case fewer threads than 'outLen')
for (int i=threadIdx.x; i<outLen; i+=blockDim.x)
// (vs all surviving threads)
atomicAdd(&out[i], outBlock[i]);
/* This final reduction may be possible in parallel/tiers,
* though documentation & examples about reducing shared memory
* between blocks is disappointingly scarce
*/
}
void myFunc(double* in, int inLen, double* out, int outLen, ...) {
// if 'out' cannot fit in each block's shared memory, error and quit
if (outLen > 4096)
...
// copy unspecified additional args into device memory
double d_in = ...
// create global memory for final output (initialised to zero)
double d_out = ...
cudaMemset(d_out, 0, ...);
// allocate one thread per element of 'in'
int numThreadsPerBlock = 128;
int numBlocks = ceil(inLen / (double) numThreadsPerBlock);
// allocate shared memory to fit 'out' in each block
size_t mem = outLen * sizeof(double);
myfunc_kernel<<numBlocks, numThreadsPerBlock, mem>>(d_in, inLen, d_out, ...);
// copy output to host, and clean up
cudaMemcpy(out, d_out, ...);
cudaFree(...);
}
Note that all reductions were ultimately performed atomically. This should not be necessary, since after the shared memory is finalised, the reduction into global memory has strict structure. Yet, I am unable to find documentation or a single demonstration online of how to collectively reduce shared memory arrays into global memory.
A
This method uses the same paradigm as this histogram example. It creates #numBlocks
copies of out
in global memory, so that each block writes to exclusive memory. Threads within a block write atomically to their block's global memory partition. Thereafter, a second kernel reduces these partitions into the final out
array. This significantly constrains outLen
, to be smaller than deviceGlobalMem/numBlocks
.
__global__ void myfunc_populate_kernel(double* in, int inLen, double* outs, int outLen, ...) {
// each thread handles one element of 'in'
int inInd = blockIdx.x*blockDim.x + threadIdx.x;
if (inInd >= inLen)
return;
// determine 'outInd' from other args
int outInd = ...
// map to this block's exclusive subarray in global memory
int blockOutInd = blockIdx.x*outLen + outInd;
// atomically contribute thread's 'in' value to block's global memory
// (vs threads in same block)
atomicAdd(&outs[outInd], in[inInd]);
}
__global__ void myfunc_reduce_kernel(double* out, double* outs, int outLen, int numOuts) {
// each thread handles one element of 'out'
int outInd = blockIdx.x*blockDim.x + threadIdx.x;
if (outInd >= outLen)
return;
// and determines it by reducing previous per-block partitions
double total = 0;
for (int n=0; n<numOuts; n++)
total += outs[outInd + n*outLen];
out[outInd] = total;
}
void myfunc(double* in, int inLen, double* outs, int outLen, ...) {
int numThreadsPerBlock = 128;
int numPopBlocks = ceil(inLen / (double) numThreadsPerBlock);
// if #numPopBlocks copies of 'out' cannot fit in global memory, error and exit
if (outLen * numPopBlocks > ...)
...
// copy unspecified additional args into device memory (shared, if fits)
double d_in = ...
// create a working 'out' for each block, in global memory (init to zero)
size_t mem = outLen * numPopBlocks * sizeof(double);
double d_outs = ...
cudaMalloc(&d_outs, mem);
cudaMemset(d_out, 0, mem);
// populate each blocks partition of global memory 'd_outs'
myfunc_populate_kernel<<numPopBlocks, numThreadsPerBlock>>(d_in, inLen, d_outs, outLen, ...);
// create global memory for final output
double d_out = ...
// reduce each blocks partition of global memory into output 'd_out'
int numReduceBlocks = ceil(outLen / (double) numThreadsPerBlock);
myfunc_reduce_kernel<<numReduceBlocks, numThreadsPerBlock>>(d_out, d_outs, outLen, numPopBolocks);
//copy output to host, and clean up
cudaMemcpy(out, d_out, ...);
cudaFree(...);
}
Comparison
I tested these methods with a Quadro P6000 (24GB, 6.1 CC), with inLen = 2^5, 2^10, 2^15, 2^20, 2^25
and (for all values) outLen = m2^5, 2^10, 2^15
(outLen <= inLen
). Here is a table of the runtimes:
╔══════════╦═════════╦══════════╦══════════╦══════════╗
║ inLen ║ outLen ║ A ║ B ║ C ║
╠══════════╬═════════╬══════════╬══════════╬══════════╣
║ 32 ║ 32 ║ 0.000067 ║ 0.000071 ║ 0.000039 ║
║ 1024 ║ 32 ║ 0.000244 ║ 0.000044 ║ 0.000071 ║
║ 1024 ║ 1024 ║ 0.000265 ║ 0.000043 ║ 0.000064 ║
║ 32768 ║ 32 ║ 0.000290 ║ 0.000077 ║ 0.000080 ║
║ 32768 ║ 1024 ║ 0.000287 ║ 0.000059 ║ 0.000096 ║
║ 32768 ║ 32768 ║ 0.000762 ║ 0.000111 ║ N/A ║
║ 1048576 ║ 32 ║ 0.000984 ║ 0.000793 ║ 0.000254 ║
║ 1048576 ║ 1024 ║ 0.001381 ║ 0.000343 ║ 0.002302 ║
║ 1048576 ║ 32768 ║ 0.017547 ║ 0.000899 ║ N/A ║
║ 1048576 ║ 1048576 ║ N/A ║ 0.006092 ║ N/A ║
║ 33554432 ║ 32 ║ 0.021068 ║ 0.022684 ║ 0.008079 ║
║ 33554432 ║ 1024 ║ 0.032839 ║ 0.008190 ║ 0.014071 ║
║ 33554432 ║ 32768 ║ 0.013734 ║ 0.011915 ║ N/A ║
╚══════════╩═════════╩══════════╩══════════╩══════════╝
Even despite having fewer limitations, method B is a clear winner, as more clearly illustrated here:
╔══════════════╦══════╦═════╗
║ outLen/inLen ║ A/B ║ C/B ║
╠══════════════╬══════╬═════╣
║ 1 ║ 0.9 ║ 0.5 ║
║ 32 ║ 5.5 ║ 1.6 ║
║ 1 ║ 6.2 ║ 1.5 ║
║ 1024 ║ 3.8 ║ 1.0 ║
║ 32 ║ 4.9 ║ 1.6 ║
║ 1 ║ 6.9 ║ ║
║ 32768 ║ 1.2 ║ 0.3 ║
║ 1024 ║ 4.0 ║ 6.7 ║
║ 32 ║ 19.5 ║ ║
║ 1 ║ ║ ║
║ 1048576 ║ 0.9 ║ 0.4 ║
║ 32768 ║ 4.0 ║ 1.7 ║
║ 1024 ║ 1.2 ║ ║
╚══════════════╩══════╩═════╝
Pitfalls
It is difficult to be polite about the state of CUDA documentation and the examples online. So I won't comment, but offer up these warnings of pitfalls to the puzzled reader:
- Beware that while shared memory seems to be initialised to zero at the first execution of the kernel, it is not re-initialised in subsequent executions. Hence you must re-initialise manually within a kernel (there is no convenient CUDA function).
atomicAdd
is missing an overload for double
types. The doc claims this is true only for compute capabilities < 60, but my code would not compile with my 62 CC GPU with no manual overload. Furthermore, the doc's code-snippet of #if __CUDA_ARCH__ < 600
is wrong, since newer arch precompilers don't define __CUDA_ARCH__
. View a correct macro here
- Beware that the popular parallel reduction paradigms you view online will assume all data is in the global memory, or in a single block's shared memory. Reducing shared memory into a global structure seems esoteric.
Determining bitInd
For the interested reader to better understand the out
access pattern, here is how bitInd
is determined from each thread, of which (between all blocks combined) there are inLen
. Note the actual index types are long long int
, since inLen
and outLen
can both be as big as 2^30 = 1073741824
__global__ void myfunc_kernel(
double* in, int inLen, double* out, int* bitInds, int numBits
) {
long long int inInd = blockIdx.x*blockDim.x + threadIdx.x;
if (inInd >= inLen)
return;
// determine the number encoded by bits {bitInds} of inInd
long long int outInd = 0;
for (int b=0; b<numBits; b++) {
int bit = (inInd >> bitInds[b]) & 1;
outInd += bit * (1LL << b);
}
...
}
For some possible inputs of bitInds
, the access pattern is simple. For example:
- If
bitInds = {0, 1, ..., n-1}
(contiguous increasing from 0
), then outInd = inInd % 2^n
, and our kernel is (sub-optimally) performing reductions of contiguous partitions of in
.
- If
bitInds = {n, n+1, n+2, ..., log2(inLen)-1}
, then outInd = floor(inInd / 2^(log2(inLen) - n))
, or simply outInd = floor(inInd 2^n / inLen)
.
However, if bitInds
is an arbitrarily ordered subset of indices {0, ..., log2(inLen)-1}
, then outInd
must be determined from each bit as performed above.