In theory you want atom_cmpxchg for correctness here (or find the equivalent GPGPU). However, a grave note of warning, having the entire machine serializing through a single cacheline is going to strangle your performance fundamentally. Atomics on the same address must form a queue and wait. Atomics on different locations can parallelize (more details at the end).
Generally, algorithms that leverage random variables on GPGPU will keep their own copy of the random variable generators. This enables each work item to cache and potentially reuse their own random with out glutting the bus with memory traffic on every new random. Search for "OpenCL Monte Carlo" "Simulation" or "Example" for samples. CUDA has some nice examples too.
Another option is to use a random generator that allows one to skip ahead and have different work items move forward in the sequence different amounts. This can be more compute intensive though, but the tradeoff is that you don't strain the memory hierarchy as much.
More gory details on atomics: (1) GPU cache atomics are designed to expect contiguous arrays and atomic ALUs are per bank, (2) each dword in a cacheline will be processed by the same atomic ALU each time, and (3) neighboring cachelines will hash to different banks. So, if every clock you are doing atomics on contiguous cachelines of data then the work should be perfectly spread out (or statistically so). Conversely, if one makes every work item atomically modify the same 32b, then the cache system cannot apply all the same atomic ALU slot to 16/32/64 (whatever your system uses). It must break the operation up in 16/32/64 separate atomic operations apply it iteratively (by #2 above). In a system where you have 512 ALUs to process atomics you would be using 1 of those ALUs each clock (the same one). Spread the work out and you can use all 512/c.