How to have atomic load in CUDA

Question

My question is how I can have atomic load in CUDA. Atomic exchange can emulate atomic store. Can atomic load be emulated non-expensively in a similar manner? I can use an atomic add with 0 to load the content atomically but I think it is expensive because it does an atomic read-modify-write instead of only a read.

So you want a blocking load? That sounds like you'll need to roll your own mutex. — talonmies, Sep 01 '15 at 21:31
More specifically, I want something like atomic load and store in c++ http://en.cppreference.com/w/cpp/atomic/atomic/load — kirill, Sep 01 '15 at 21:41
I really don't understand this question. A proper load of a quantity up to 128bits per thread is "atomic" in the sense that no part of the load will be modified by "intervening" (loads or) stores. Stores by themselves are likewise guaranteed to be atomic. The purpose of the atomic functions is to provide an uninterrupted RMW facility. — Robert Crovella, Sep 10 '15 at 23:47

Allison Vacanti · Answer 1 · 2019-03-06T22:22:52.243

In addition to using volatile as recommended in the other answer, using __threadfence appropriately is also required to get an atomic load with safe memory ordering.

While some of the comments are saying to just use a normal read because it cannot tear, that is not the same as an atomic load. There's more to atomics than just tearing:

A normal read may reuse a previous load that's already in a register, and thus may not reflect changes made by other SMs with the desired memory ordering. For instance, int *flag = ...; while (*flag) { ... } may only read flag once and reuse this value for every iteration of the loop. If you're waiting for another thread to change the flag's value, you'll never observe the change. The volatile modifier ensures that the value is actually read from memory on every access. See the CUDA documentation on volatile for more info.

Additionally, you'll need to use a memory fence to enforce the correct memory ordering in the calling thread. Without a fence, you get "relaxed" semantics in C++11 parlance, and this can be unsafe when using an atomic for communication.

For example, say your code (non-atomically) writes some large data to memory and then uses a normal write to set an atomic flag to indicate that the data has been written. The instructions may be reordered, hardware cachelines may not be flushed prior to setting the flag, etc etc. The result is that these operations are not guaranteed to be executed in any order, and other threads may not observe these events in the order you expect: The write to the flag is permitted to be happen before the guarded data is written.

Meanwhile, if the reading thread is also using normal reads to check the flag before conditionally loading the data, there will be a race at the hardware level. Out-of-order and/or speculative execution may load the data before the flag's read is completed. The speculatively loaded data is then used, which may not be valid since it was loaded prior to the flag's read.

Well-placed memory fences prevent these sorts of issues by enforcing that instruction reordering will not affect your desired memory ordering and that previous writes are made visible to other threads. __threadfence() and friends are also covered in the CUDA docs.

Putting all of this together, writing your own atomic load method in CUDA looks something like:

// addr must be aligned properly.
__device__ unsigned int atomicLoad(const unsigned int *addr)
{
  const volatile unsigned int *vaddr = addr; // volatile to bypass cache
  __threadfence(); // for seq_cst loads. Remove for acquire semantics.
  const unsigned int value = *vaddr;
  // fence to ensure that dependent reads are correctly ordered
  __threadfence(); 
  return value; 
}

// addr must be aligned properly.
__device__ void atomicStore(unsigned int *addr, unsigned int value)
{
  volatile unsigned int *vaddr = addr; // volatile to bypass cache
  // fence to ensure that previous non-atomic stores are visible to other threads
  __threadfence(); 
  *vaddr = value;
}

This can be written similarly for other non-tearing load/store sizes.

From talking with some NVIDIA devs who work on CUDA atomics, it looks like we should start seeing better support for atomics in CUDA, and the PTX already contains load/store instructions with acquire/release memory ordering semantics -- but there is no way to access them currently without resorting to inline PTX. They're hoping to add them in sometime this year. Once those are in place, a full std::atomic implementation shouldn't be far behind.

This __threadfence() approach makes way more sense in my head than using "volatile". Good points about "seq_cst" with the first threadfence() in atomicLoad, I didn't think about that particular fence. — Dragontamer5788, Jun 21 '19 at 18:12

score 2 · Answer 2 · edited May 23 '17 at 12:00

2

To the best of my knowledge, there is currently no way of requesting an atomic load in CUDA, and that would be a great feature to have.

There are two quasi-alternatives, with their advantages and drawbacks:

Use a no-op atomic read-modify-write as you suggest. I have provided a similar answer in the past. Guaranteed atomicity and memory consistency but you pay the cost of a needless write.
In practice, the second closest thing to an atomic load could be marking a variable volatile, although strictly speaking the semantics are completely different. The language does not guarantee atomicity of the load (for example, you may in theory get a torn read), but you are guaranteed to get the most up-to-date value. But in practice, as indicated in the comments by @Robert Crovella, it is impossible to get a torn read for properly-aligned transactions of at most 32 bytes, which does make them atomic.

Solution 2 is kind of hacky and I do not recommend it, but it is currently the only write-less alternative to 1. The ideal solution would be to add a way to express atomic loads directly in the language.

edited May 23 '17 at 12:00

Community

1
1

answered Sep 02 '15 at 01:41

user703016

37,307
8
87
112

1

I'm not sure if `volatile` qualifier can help atomicity of a load. I think [it only enforces the generated PTX load operations to have a `.cv` suffix](http://docs.nvidia.com/cuda/parallel-thread-execution/#cache-operators) and consider existing value in the cache stale. Will it also make the load operation not seen torn by the thread? – Farzad Sep 02 '15 at 02:17
@Farzad `volatile` will not help for atomicity at all indeed, hence why the OP should use a no-op RMW if they want that guarantee. There can't be torn writes or reads for anything smaller or equal to the native word size, so those can't happen for 32-bit types. For 64-bit, yes it possible. I do not recommend using `volatile` but the OP said they didn't want to pay for the extra atomic write. I'll edit this in. – user703016 Sep 02 '15 at 02:24
A properly aligned load of a 64-bit type cannot be "torn" or partially modified by an "intervening" write. I think this whole question is silly. **All** memory transactions are performed with respect to the L2 cache. The L2 cache serves up 32-byte cachelines **only**. There is no other transaction possible. A properly aligned 64-bit type will *always* fall into a single L2 cacheline, and the servicing of that cacheline **cannot** consist of some data prior to an extraneous write (that would have been modified by the extraneous write), and some data after the same extraneous write. – Robert Crovella Sep 10 '15 at 23:52
@Robert As far as the *language* is concerned, it is allowed to happen (hence my "in theory"). There is currently no way in CUDA to express the intent "atomically load this 64-bit type". – user703016 Sep 11 '15 at 01:01

How to have atomic load in CUDA

2 Answers2

Linked