I am looking for the optimisation strategy for my cuda program. At each iteration inside the for loop of my kernel, each thread produces a score. I am maintaining a shared priority queue of the scores to maintain top-k of them per block. Please see the pseudo-code below:
__global__ gpuCompute(... arguments)
{
__shared__ myPriorityQueue[k]; //To maintain top k scores ( k < #threads in this block)
__shared__ scoresToInsertInQueue[#threadsInBlock];
__shared__ counter;
for(...) //About 1-100 million iterations
{
int score = calculate_score(...);
if(score > Minimum element of P. Queue && ...)
{
ATOMIC Operation : localCounter = counter++;
scoresToInsertInQueue[localCounter] = score;
}
__syncthreads();
//Merge scores from scoresToInsertInQueue to myPriorityQueue
while(counter>0)
{
//Parallel insertion of scoresToInsertInQueue[counter] to myPriorityQueue using the participation of k threads in this block
counter--;
__syncthreads();
}
__syncthreads();
}
}
Hoping that above code makes sense to you guys. Now, I am looking for a way to remove the atomic operation overhead s.t. each thread saves '1' or '0' depending upon the value should go to priority queue or not. I am wondering if any there is any implementation of stream-compaction within kernel so that I can reduce '1000000000100000000' to '11000000000000000000' buffer (or know the index of '1's) and finally insert the scores corresponding to '1's in the queue.
Note that '1's would be very sparse in this case.