I am trying to optimize my histogram calculations in CUDA. It gives me an excellent speedup over corresponding OpenMP CPU calculation. However, I suspect (in keeping with intuition) that most of the pixels fall into a few buckets. For argument's sake, assume that we have 256 pixels falling into let us say, two buckets.
The easiest way to do it is to do it appears to be
- Load the variables into shared memory
- Do vectorized loads for unsigned char, etc. if needed.
- Do an atomic add in shared memory
- Do a coalesced write to global.
Something like this:
__global__ void shmem_atomics_reducer(int *data, int *count){
uint tid = blockIdx.x*blockDim.x + threadIdx.x;
__shared__ int block_reduced[NUM_THREADS_PER_BLOCK];
block_reduced[threadIdx.x] = 0;
__syncthreads();
atomicAdd(&block_reduced[data[tid]],1);
__syncthreads();
for(int i=threadIdx.x; i<NUM_BINS; i+=NUM_BINS)
atomicAdd(&count[i],block_reduced[i]);
}
The performance of this kernel drops (naturally) when we decrease the number of bins, from around 45 GB/s at 32 bins to around 10 GB/s at 1 bin. Contention, and shared memory bank conflicts are given as reasons. I don't know if there is any way to remove either of these for this calculation in any significant way.
I've also been experimenting with another (beautiful) idea from the parallelforall blog involving warp level reductions using __ballot to grab warp results and then using __popc() to do the warp level reduction.
__global__ void ballot_popc_reducer(int *data, int *count ){
uint tid = blockIdx.x*blockDim.x + threadIdx.x;
uint warp_id = threadIdx.x >> 5;
//need lane_ids since we are going warp level
uint lane_id = threadIdx.x%32;
//for ballot
uint warp_set_bits=0;
//to store warp level sum
__shared__ uint warp_reduced_count[NUM_WARPS_PER_BLOCK];
//shared data
__shared__ uint s_data[NUM_THREADS_PER_BLOCK];
//load shared data - could store to registers
s_data[threadIdx.x] = data[tid];
__syncthreads();
//suspicious loop - I think we need more parallelism
for(int i=0; i<NUM_BINS; i++){
warp_set_bits = __ballot(s_data[threadIdx.x]==i);
if(lane_id==0){
warp_reduced_count[warp_id] = __popc(warp_set_bits);
}
__syncthreads();
//do warp level reduce
//could use shfl, but it does not change the overall picture
if(warp_id==0){
int t = threadIdx.x;
for(int j = NUM_WARPS_PER_BLOCK/2; j>0; j>>=1){
if(t<j) warp_reduced_count[t] += warp_reduced_count[t+j];
__syncthreads();
}
}
__syncthreads();
if(threadIdx.x==0){
atomicAdd(&count[i],warp_reduced_count[0]);
}
}
}
This gives decent numbers (well, that is moot - peak device mem bw is 133 GB/s, things seem to depend on launch configuration) for the single bin case (35-40 GB/s for 1 bin, as against 10-15 GB/s using atomics), but performance drops drastically when we increase the number of bins. When we run with 32 bins, performance drops to about 5 GB/s. The reason might perhaps be because of the single thread looping through all the bins, asking for parallelization of the NUM_BINS, loop.
I have tried several ways of going about parallelizing the NUM_BINS loop, none of which seem to work properly. For example, one could (very inelegantly) manipulate the kernel to create some blocks for each bin. This seems to behave the same way, possibly because we would again suffer from contention with multiple blocks attempting to read from global memory. Plus, the programming is clunky. Likewise, parallelizing in the y direction for bins gives similarly uninspiring results.
The other idea I tried just for kicks was dynamic parallelism, launching a kernel for each bin. This was disastrously slow, possibly owing to no real compute work for the child kernels and the launch overhead.
The most promising approach seems to be - from Nicholas Wilt's article
on using these so-called privatized histograms containing bins for each thread in shared memory, which would ostensibly be very heavy on shmem usage (and we only have 48 kB per SM on Maxwell).
Perhaps someone could shed some insight into the problem? I feel that one ought to go change the algorithm instead so as not to use histograms, to use something less frequentist. Otherwise, I suppose we just use the atomics version.
Edit: The context for my problem is in computing probability density functions to be used for pattern-classification. We can compute approximate histograms (more precisely, pdfs) by using non-parametric methods such as Parzen Windows or Kernel Density Estimation. However, this does not overcome the problem of dimensionality as we need to sum over all data points for every bin, which becomes expensive when the number of bins becomes large. See here: Parzen