CUDA: atomicAdd takes too much time, serializing threads

Question

I have a kernel which makes some comparisons and decides whether two objects collide or not. I want to store the colliding objects' id's to an output buffer. I do not want to have gap in the output buffer. I want to record each collision to a unique index in the output buffer.

So I created an atomic variable in the shared memory (local sum), and also in global memory (global sum). The code below shows the incrementing of the shared variable as the collision is found. I do not have problem with incrementing atomic variable at global memory for now.

__global__ void mykernel(..., unsigned int *gColCnt) {
    ...

    __shared__ unsigned int sColCnt;
    __shared__ unsigned int sIndex;

    if (threadIdx.x == 0) {
        sColCnt = 0;
    }
    
    __syncthreads();
    
    unsigned int index = 0;
    if (colliding)
        index = atomicAdd(&sColCnt, 1); //!!Time Consuming!!

    __syncthreads();

    if (threadIdx.x == 0)
        sIndex = atomicAdd(gColCnt, sColCnt);
    
    __syncthreads();
    
    if (sColCnt + sIndex > outputSize) { //output buffer is not enough
        //printf("Exceeds outputsize: %d + %d > %d\n", sColCnt, sIndex, outputSize);
        return;
    }
    
    if (colliding) {
        output[sIndex + index] = make_uint2(startId, toId);
    }
}

My problem is that, when many threads try to increment the atomic variable, they get serialized. Before writing something like prefix-sum, I wanted to ask if there is a way of getting this done efficiently.

The elapsed time of my kernel increases from 13msec to 44msec because of this one line out there.

I found a prefix-sum example code but its referenced links fails because of NVIDIA's discussing board is down. https://stackoverflow.com/a/3836944/596547

Edit: I have added the end of my code too to above. In fact I do have an hierarchy. To see the affect of every code line, I setup scenes where every object collides with each other, extreme case, and another extreme case where approximately no objects collide.

At the end I add the shared atomic variable to a global variable (gColCnt) to inform outside about the number of collisions and find correct index values. I think I have to use atomicAdd here in any way.

`atomicAdd` serializes by definition, so you should only rely on it when you predict that collisions will be sparse. Perhaps you could restructure your computation to use atomics hierarchically: first, accumulate into a `__shared__` variable in each thread block. In a postprocess (e.g., after the 3rd `__syncthreads` of your kernel above), you could accumulate each block's collisions into a single variable in global memory. — Jared Hoberock, Jul 22 '12 at 06:17
In fact I do have an hierarchy. But threads in the same block serializes on atomicAdd for __shared__ variable too, at least for the first extreme case, where every object collides with each other. — phoad, Jul 22 '12 at 11:46
www.cuvilib.com/Reduction.pdf I have found a tutorial of M. Harris. I will try to utilize it. — phoad, Jul 22 '12 at 12:17
Actually, the atomic performance depends on your hardware. What hardware are you running on? — harrism, Jul 23 '12 at 01:45
I have two Gtx460. I wrote the reduction code, but it seemed I need prefix-sum (hardly wrote it, need some fixes, bank conflicts etc). It took half of the elapsed time for the fully colliding example. But first saving the collisions then compaction with thrust::remove_if also gave good results. I am a bit stuck though. — phoad, Jul 23 '12 at 07:12

score 1 · Accepted Answer · answered Jul 22 '12 at 14:24

1

Consider using a parallel stream compaction algorithm, for instance thrust::copy_if.

answered Jul 22 '12 at 14:24

Roger Dahl

15,132
8
62
82

I think I had already thought why I could not do call thrust::copy_if, but I could not figure out the reason now. I will try the small prefix-sum in the kernel and then re-think and try this and inform. Thank you. – phoad Jul 22 '12 at 14:49
Yes, I do not know how many collisions will be found. So the output buffer size is unknown (may be too huge). I do an initial output buffer size estimation, and copy the collisions to the buffer as much as it can take. If more buffer is needed, I scale the buffer up and call the kernel again. Is it a good approach? – phoad Jul 22 '12 at 15:58
1

Looks like a single thread can find either zero or one collisions? Then you can try allocating one slot per thread and write to that slot if a collision is found. Then see how long gathering up the results with the stream compaction algorithm takes compared to the kernel that finds the collisions and then see if it's worth it to pursue a more advanced solution. – Roger Dahl Jul 22 '12 at 17:13
Number of collision checks is quite high, because I want to make a batch collision check. Namely, like 1000.000 checks for 500 objects. I was scared of insufficient memory, if I advocate a slot for every possible result. Decreasing the batch size may be a solution though. – phoad Jul 22 '12 at 17:21

CUDA: atomicAdd takes too much time, serializing threads

1 Answers1