Stream compaction within cuda kernel for maintaining priority queue

Question

I am looking for the optimisation strategy for my cuda program. At each iteration inside the for loop of my kernel, each thread produces a score. I am maintaining a shared priority queue of the scores to maintain top-k of them per block. Please see the pseudo-code below:

__global__ gpuCompute(... arguments)
{
    __shared__ myPriorityQueue[k];  //To maintain top k scores ( k < #threads in this block)
    __shared__ scoresToInsertInQueue[#threadsInBlock];
    __shared__ counter;
    for(...)       //About 1-100 million iterations
    {
        int score = calculate_score(...);
        if(score > Minimum element of P. Queue && ...)
        {
            ATOMIC Operation : localCounter = counter++;
            scoresToInsertInQueue[localCounter] = score;
        }
        __syncthreads();
        //Merge scores from scoresToInsertInQueue to myPriorityQueue
        while(counter>0)
        {
            //Parallel insertion of scoresToInsertInQueue[counter] to myPriorityQueue using the participation of k threads in this block
            counter--;  
            __syncthreads(); 
        }
        __syncthreads();
    }
}

Hoping that above code makes sense to you guys. Now, I am looking for a way to remove the atomic operation overhead s.t. each thread saves '1' or '0' depending upon the value should go to priority queue or not. I am wondering if any there is any implementation of stream-compaction within kernel so that I can reduce '1000000000100000000' to '11000000000000000000' buffer (or know the index of '1's) and finally insert the scores corresponding to '1's in the queue.
Note that '1's would be very sparse in this case.

[thrust](https://github.com/thrust/thrust/wiki/Quick-Start-Guide) has [stream-compaction functions](http://thrust.github.io/doc/group__stream__compaction.html) but it seems like it would require you to break up your kernel into pieces. — Robert Crovella, Feb 10 '14 at 10:36
@Robert: Yes, I read about that and this solution worked for this [question](https://devtalk.nvidia.com/default/topic/453910/?comment=3228624). Breaking it into multiple kernels will lead to very bad performance here. — Chirag Jain, Feb 10 '14 at 10:42

Robert Crovella · Answer 1 · 2014-02-11T05:33:51.827

If the ones are very sparse, the atomic method may be fastest. However the method I describe here will have more predictable and bounded worst-case performance.

For a good mix of ones and zeroes in your decision array, it may be faster to use a parallel scan or prefix-sum to build an insertion-point index array out of the decision array:

Suppose I have a thresholding decision that chooses scores > 30 to go into the queue. My data might look like this:

scores:     30  32  28  77  55  12  19
score > 30:  0   1   0   1   1   0   0
insert_pt:   0   0   1   1   2   3   3    (an "exclusive prefix sum")

Then each thread makes a storage choice as follows:

if (score[threadIdx.x] > 30) temp[threadIdx.x] = 1;
else temp[threadIdx.x] = 0;
__syncthreads();
// perform exclusive scan on temp array into insert_pt array
__syncthreads();
if (temp[threadIdx.x] == 1)
  myPriorityQueue[insert_pt[threadIdx.x]] = score[threadIdx.x];

CUB has a fast parallel prefix scan.

Stream compaction within cuda kernel for maintaining priority queue

1 Answers1