Speeding up CUDA atomics calculation for many bins/few bins

Question

I am trying to optimize my histogram calculations in CUDA. It gives me an excellent speedup over corresponding OpenMP CPU calculation. However, I suspect (in keeping with intuition) that most of the pixels fall into a few buckets. For argument's sake, assume that we have 256 pixels falling into let us say, two buckets.

The easiest way to do it is to do it appears to be

Load the variables into shared memory
- Do vectorized loads for unsigned char, etc. if needed.
Do an atomic add in shared memory
Do a coalesced write to global.

Something like this:

__global__ void shmem_atomics_reducer(int *data, int *count){
  uint tid = blockIdx.x*blockDim.x + threadIdx.x;

  __shared__ int block_reduced[NUM_THREADS_PER_BLOCK];
  block_reduced[threadIdx.x] = 0;

  __syncthreads();

    atomicAdd(&block_reduced[data[tid]],1);
  __syncthreads();

  for(int i=threadIdx.x; i<NUM_BINS; i+=NUM_BINS)
    atomicAdd(&count[i],block_reduced[i]);

}

The performance of this kernel drops (naturally) when we decrease the number of bins, from around 45 GB/s at 32 bins to around 10 GB/s at 1 bin. Contention, and shared memory bank conflicts are given as reasons. I don't know if there is any way to remove either of these for this calculation in any significant way.

I've also been experimenting with another (beautiful) idea from the parallelforall blog involving warp level reductions using __ballot to grab warp results and then using __popc() to do the warp level reduction.

__global__ void ballot_popc_reducer(int *data, int *count ){
  uint tid = blockIdx.x*blockDim.x + threadIdx.x;
  uint warp_id = threadIdx.x >> 5;

  //need lane_ids since we are going warp level
  uint lane_id = threadIdx.x%32;

  //for ballot
  uint warp_set_bits=0;

   //to store warp level sum
  __shared__ uint warp_reduced_count[NUM_WARPS_PER_BLOCK];
   //shared data
  __shared__ uint s_data[NUM_THREADS_PER_BLOCK];

 //load shared data - could store to registers
  s_data[threadIdx.x] = data[tid];

  __syncthreads();


//suspicious loop - I think we need more parallelism
  for(int i=0; i<NUM_BINS; i++){
      warp_set_bits = __ballot(s_data[threadIdx.x]==i);

      if(lane_id==0){
        warp_reduced_count[warp_id] = __popc(warp_set_bits);
      }

     __syncthreads();

      //do warp level reduce 
      //could use shfl, but it does not change the overall picture
      if(warp_id==0){
        int t = threadIdx.x;
        for(int j = NUM_WARPS_PER_BLOCK/2; j>0; j>>=1){
          if(t<j) warp_reduced_count[t] += warp_reduced_count[t+j];
          __syncthreads();
        }
      }                                                                                                                                                                                                                                                                



      __syncthreads();


      if(threadIdx.x==0){
        atomicAdd(&count[i],warp_reduced_count[0]);
        }  

    }                                                                                                                                                                                                                                             

  }

This gives decent numbers (well, that is moot - peak device mem bw is 133 GB/s, things seem to depend on launch configuration) for the single bin case (35-40 GB/s for 1 bin, as against 10-15 GB/s using atomics), but performance drops drastically when we increase the number of bins. When we run with 32 bins, performance drops to about 5 GB/s. The reason might perhaps be because of the single thread looping through all the bins, asking for parallelization of the NUM_BINS, loop.

I have tried several ways of going about parallelizing the NUM_BINS loop, none of which seem to work properly. For example, one could (very inelegantly) manipulate the kernel to create some blocks for each bin. This seems to behave the same way, possibly because we would again suffer from contention with multiple blocks attempting to read from global memory. Plus, the programming is clunky. Likewise, parallelizing in the y direction for bins gives similarly uninspiring results.

The other idea I tried just for kicks was dynamic parallelism, launching a kernel for each bin. This was disastrously slow, possibly owing to no real compute work for the child kernels and the launch overhead.

The most promising approach seems to be - from Nicholas Wilt's article

on using these so-called privatized histograms containing bins for each thread in shared memory, which would ostensibly be very heavy on shmem usage (and we only have 48 kB per SM on Maxwell).

Perhaps someone could shed some insight into the problem? I feel that one ought to go change the algorithm instead so as not to use histograms, to use something less frequentist. Otherwise, I suppose we just use the atomics version.

Edit: The context for my problem is in computing probability density functions to be used for pattern-classification. We can compute approximate histograms (more precisely, pdfs) by using non-parametric methods such as Parzen Windows or Kernel Density Estimation. However, this does not overcome the problem of dimensionality as we need to sum over all data points for every bin, which becomes expensive when the number of bins becomes large. See here: Parzen

As your question stands, it should probably be closed due to "Not clear what your asking" - you're being a bit vague in some places, especially about the exact constraints of your problem, what you expect to happen in your "let's say" example and so on. Also, you're basically asking for an opinion rather than a concrete answer to a question, which is another reason to close. However, I am personally working on almost the same thing right now, so I'm biased. At any rate, I would actually like to offer my opinion - off-site. — einpoklum, Oct 16 '16 at 15:43
I've created a [room](http://chat.stackoverflow.com/rooms/125842) if you'd like to talk further. — einpoklum, Oct 16 '16 at 16:08
I was actually looking for something on the lines of a style guide on histogram and atomics calculation when there is a lot of degeneracy in the inputs. Happy to discuss. — kakrafoon, Oct 22 '16 at 04:53
After nearly three years, and studying a bit more machine learning, I have decided that the problem is not so much with implementing the algorithm (we use mutual information, where the joint histogram necessitates using atomics), but of the algorithm itself. So I think we can consider this problem 'solved' for our purposes. There is no cure to the atomics malaise. — kakrafoon, Jun 18 '19 at 21:57

score 1 · Answer 1 · answered Aug 22 '17 at 21:17

I faced similar chalanges to work with clustering, but in the botton end, the best solution was to use the scan pattern to group the processing. So, I don't think that it would work for you. Since you asked for some experience in this are, I'll share mine with you.

The issues

In your 1st code, I guess that the deal with the low performance with the number of bins reduction is linked to warp stall, since you do perform very little processing for every evaluated data. When the number of bins is increased, the relation between processing and global memory load (data info) for that kernel is also increased. You can check that very easily with the "Issue Efficiency" Experiments at the Performance Analysis from Nsight. Probably you are getting a low rate of cycles with at least one elegible warp (Warp Issue Efficiency).

Since I was not able to improve the number of elegible warps to somewhere close to 95%, I gave up this approach, since for some cases it gets worse (the memory dependency stall 90% of my processing cycles.

The shuffle and vote reduction is very usefull if the number of bins is not to large. If it is to large, a small amount of threads should be active for every bin filter. So you may end up with a lot of code divergence, and that is very undesirable for parallel processing. You may try to group the divergence in order to remove branching and have a good control flow, so the whole warp/block presents a similar processing, but a lot chance across blocks.

A feasible solution

I don't know where, but there are very good solutions for your problem around that I saw. Did you tried this one?

Also you can use a vectorized load and try something like that, but I'm not sure how much would it improve your performance:

__global__ hist(int4 *data, int *count, int N, int rem, unsigned int init) {

__shared__ unsigned int sBins[N_OF_BINS]; // you may want to declare this one dinamically
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (threadIdx.x < N_OF_BINS) sBins[threadIdx.x] = 0; 

for (int i = 0; i < N; i+= warpSize) {
    atomicAdd(&sBins[data[i + init].w], 1);
    atomicAdd(&sBins[data[i + init].x], 1);
    atomicAdd(&sBins[data[i + init].y], 1);
    atomicAdd(&sBins[data[i + init].z], 1);
}

//process remaining elements if the data is not multiple of 4
// using recast and a additional control
for (int i = 0; i < rem; i++) {
    atomicAdd(&sBins[reinterpret_cast<int*>(data)[N * 4 + init + i]], 1);
} 
//update your histogram data here
}

Vectorized loads doesn't improve performance in >Kepler arch (back when I tested it), but nice tips on profiling! — kakrafoon, Apr 14 '18 at 00:10

Speeding up CUDA atomics calculation for many bins/few bins

1 Answers1