What is faster in CUDA: global memory write + __threadfence() or atomicExch() to global memory?

Question

Assuming that we have lots of threads that will access global memory sequentially, which option performs faster in the overall? I'm in doubt because __threadfence() takes into account all shared and global memory writes but the writes are coalesced. In the other hand atomicExch() takes into account just the important memory addresses but I don't know if the writes are coalesced or not.

In code:

array[threadIdx.x] = value;

Or

atomicExch(&array[threadIdx.x] , value);

Thanks.

Could try both and report the findings? :) Or, try both, and be surprised and then then ask "But why is ABC so much faster than XYZ?" .. or, try both, and see only bland results and move on. — , Jul 20 '12 at 20:17

score 2 · Answer 1 · answered Sep 12 '12 at 06:51

2

On Kepler GPUs, I would bet on atomicExch since atomics are very fast on Kepler. On Fermi, it may be a wash, but given that you have no collisions, atomicExch could still perform well.

Please make an experiment and report the results.

answered Sep 12 '12 at 06:51

harrism

26,505
2
57
88

score 0 · Answer 2 · answered Aug 11 '16 at 15:00

Those two do very different things.

atomicExch ensures that no two threads try to modify a given cell at a time. If such conflict would occur, one or more threads may be stalled. If you know beforehand that no two threads access the same cell, there is no point to use any atomic... function.

__threadfence() delays the current thread (and only the current thread!) to ensure that any subsequent writes by given thread do actually happen later. As such, __threadfence() on its own, without any follow-up code is not very interesting.

For that reason, I don't think there is a point to compare the efficiency of those two. Maybe if you could show a bit more concrete use case I could relate...

Note, that neither of those actually give you any guarantees on the actual order of execution of the threads.

What is faster in CUDA: global memory write + __threadfence() or atomicExch() to global memory?

2 Answers2