Correct implementation of spin lock in CUDA

Question

A lot of sources provide implementation of spin lock in CUDA:

They follow the same pattern:

LOCK: wait for the atomic change of the value of lock from 0 to 1
do some critical operations
UNLOCK: release the lock by setting its value to 0

Let's assume that we don't have warp-divergence or, in other words, we don't use locks for interwarp synchronization.

What is the right way to implement step 1?

Some answers propose to use atomicCAS while other atomicExch. Are both equivalent?

while (0 != (atomicCAS(&lock, 0, 1))) {}
while (atomicExch(&lock, 1) != 0) {}

What is the right way to implement step 3?

Almost all sources propose to use atomicExch for that:

atomicExch(&lock, 0);

One user proposed an alternative (Implementing a critical section in CUDA) that also make sense, but it doesn't work for him (so probably it leads to Undefined Behavior in CUDA):

lock = 0;

It seems that for general spin lock on CPU it is valid to do that: https://stackoverflow.com/a/7007893/8044236. Why we can't use it in CUDA?

Do we have to use memory fence and `volatile` specifier for memory accesses in step 2?

CUDA docs about atomics (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions) say that they don't guarantee ordering constraints:

Atomic functions do not act as memory fences and do not imply synchronization or ordering constraints for memory operations

Does it mean that we have to use a memory fence at the end of the critical section (2) to ensure that change inside a critical section (2) made visible to other threads before unlocking (3)?

Do CUDA guarantee that other threads will ever see the changes made by a thread with atomic operations in steps (1) and (3)?

This is not true for memory fences (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions):

Memory fence functions only affect the ordering of memory operations by a thread; they do not ensure that these memory operations are visible to other threads (like __syncthreads() does for threads within a block (see Synchronization Functions)).

So probably it is also not true for atomic operations? If yes, all spinlock implementations in CUDA rely on UB.

How we can implement reliable spinlock in the presence of warps?

Now, provided that we have answers to all the questions above, let's remove the assumption that we don't have warp divergence. Is it possible to implement spinlock in such a case?

The main issue (deadlock) is represented at slide 30 of https://wlandau.github.io/gpu/lectures/cudac-atomics/cudac-atomics.pdf:

Is the only option to replace while loop by if in step (1) and enclose all 3 steps in single while loop as proposed, for example, in Thread/warp local lock in cuda or CUDA, mutex and atomicCAS()?

I consider [this mechanism](https://stackoverflow.com/a/18968893/1695960) to be fairly robust. — Robert Crovella, Dec 04 '19 at 19:57

score 2 · Answer 1 · answered Dec 03 '19 at 11:36

What is the right way to implement step 1? Some answers propose to use atomicCAS while other atomicExch. Are both equivalent?

No they are not, and only the atomicCas is correct. The point of that code is to check that the change of state of the lock from unlocked to locked by a given thread worked. The atomicExch version doesn't do that because it doesn't check that the initial state is unlocked before performing the assignment.

It seems that for general spin lock on CPU it is valid to do that: https://stackoverflow.com/a/7007893/8044236. Why we can't use it in CUDA?

If you read the comments on that answer you will see that it isn't valid on a CPU either.

Do we have to use memory fence and volatile specifier for memory accesses in step 2?

That depends completely on your end use and why you are using a critical section in the first place. Obviously, if you want a given thread's manipulation of globally visible memory to be globally visible, you need either a fence or an atomic transaction to do so, and you need to ensure that values are not cached in registers by the compiler.

Do[sic] CUDA guarantee that other threads will ever see the changes made by a thread with atomic operations in steps (1) and (3)?

Yes, but only for other operations performed atomically. Atomic operations imply serialization of all memory transactions performed on a given address, and they return the previous state of an address when a thread performs an action.

How we can implement reliable spinlock in the presence of warps?

Warp divergence is irrelevant. The serialization of atomic memory operations implies warp divergence when multiple threads from the same warp try to obtain the lock

You say that "atomicExch doesn't check that the initial state is unlocked before performing the assignment", but actually if it is not unlocked, then it assigns the value 1 (which is locked) resulting in no-op, so in some sense, it does this check — , Dec 03 '19 at 11:45
"If you read the comments on that answer you will see that it isn't valid on a CPU either." Could you point to exact comment? I don't see where someone explicitly says that `lock = false; // release lock` is incorrect — , Dec 03 '19 at 11:50
"Warp divergence is irrelevant" I added explanation to the question. Warp divergence leads to deadlocks — , Dec 03 '19 at 12:00
All that information you added about deadlocking is irrelevant. And wrong — talonmies, Dec 03 '19 at 12:04
Could you explain what is wrong with deadlocking? Do I misinterpret slides and other answers (for example https://stackoverflow.com/a/21346015/8044236) or these slides are wrong? — , Dec 03 '19 at 12:11

Correct implementation of spin lock in CUDA

What is the right way to implement step 1?

What is the right way to implement step 3?

Do we have to use memory fence and volatile specifier for memory accesses in step 2?

Do CUDA guarantee that other threads will ever see the changes made by a thread with atomic operations in steps (1) and (3)?

How we can implement reliable spinlock in the presence of warps?

1 Answers1

Do we have to use memory fence and `volatile` specifier for memory accesses in step 2?