3

The operation pseudocode for cmpxchg is as follows (Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2A: Instruction Set Reference, A-M, 2010):

IF accumulator = DEST
THEN
ZF ← 1;
DEST ← SRC;
ELSE
ZF ← 0;
accumulator ← DEST;
FI;

At least for the first sight, the accumulator changes its value if (and only if) ZF = 0. So, is it safe or ignore totally the ZF and watch just the change in accumulator value to judge whether the operation was successful or not?

In other words, can I use safely the variant #2 instead of #1?

#1:

mov eax, r8d
lock cmpxchg [rdx], ecx
jz @success

#2:

mov eax, r8d
lock cmpxchg [rdx], ecx
cmp eax, r8d
jz @success

I mean, are there some very special cases when only looking for ZF can really show whether the operation was successful or not? It might be a trivial question, but lock-free multitasking is almost impossible to debug, so I have to be 101% sure.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Zoltán Bíró
  • 346
  • 1
  • 12

1 Answers1

2

Your reasoning looks correct to me.
Wasting instructions to re-generate ZF won't cause a correctness problem, and just costs code-size if the cmp can fuse with a JCC. Also costs you an extra register, though, vs. only having the old value in EAX to get replaced.

This might be why it's ok for GNU C's old-style __sync builtins (obsoleted by __atomic builtins that take a memory-order parameter) to only provide __sync_val_compare_and_swap and __sync_bool_compare_and_swap that return the value or the boolean success result, no single builtin that returns both.

(The newer __atomic_compare_exchange_n is more like the C11/C++11 API, taking an expected by reference to be updated, and returning a bool. This may allow GCC to not waste instructions with a cmp.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Peter Cordes, the real code looks like the following: `function LockCmpXchg(new{ecx}: int32; var orig{rdx}: int32; ref{r8d}: int32): int32; asm .NOFRAME mov eax, r8d lock cmpxchg [rdx], ecx end;` In this case it's obvious that it would be an unnecessary complication to return a boolean just for evidentiating the successful operation. – Zoltán Bíró Apr 12 '22 at 08:53
  • @ZoltánBíró: If you're wrapping it up in a non-inline function, then yeah that makes sense. The GNU C builtins are of course designed to inline into functions that can make use of the FLAGS output directly, as well as EAX. – Peter Cordes Apr 12 '22 at 08:59
  • Unfortunately inline assembler routines are no more available in Delphi for 64-bits platforms. And assembler intrinsics had never been available in Delphi Pascal as well even from beginning. So calling ".NOFRAME" procedures is the only way to use fast lock-free code. If I began coding today I would definitely prefer C++, even if Pascal codes are much more readable. But it's too late… – Zoltán Bíró Apr 12 '22 at 09:05
  • @ZoltánBíró: Yeah, that's fine, I'm not saying you should be able to find a way to do it differently, just that these diffs make different choices sensible. BTW, `__atomic_compare_exchange_n` isn't really an *intrinsic* for an asm instruction - when compiling for ARM it compiles to an LDREX / STREX retry loop, for example. But clearly Delphi Pascal doesn't have direct support for atomics other than rolling your own, or else you'd be using it. I'm a bit surprised it's usable at all, or do you need an asm function even to safely read a variable without having a load hoisted out of a loop? – Peter Cordes Apr 12 '22 at 09:14
  • Exactly. Actually I simply read first (obvoiusly, without LOCK) the protected variable to help make a choice between several scenarios. (Note that I use READY-MADE data structures for each scenario to minimize time overhead.) After this, I exchange the old value with the handle/ID of the new structure, **ASSUMING the old value is still the same,** that is, conditions are not changed. If conditions have been changed in the meantime, I should reconsider the process from the beginning. `LOCK cmpxchg` is jsut perfect for this purpose and i cannot see other way to do it. – Zoltán Bíró Apr 12 '22 at 09:23
  • @ZoltánBíró: I meant that most use-cases for lock-free code involve some read-only accesses, not all CASes and other RMWs. Read-only access being able to scale to any number of parallel readers (e.g. via a SeqLock for something wider than a register) is something you can't get from a normal readers/writers lock, for cases where you aren't also modifying something in the same cache line. Or just to efficiently read some things between RMWs and stores on other locations. See https://lwn.net/Articles/793253/ for the problems an optimizing compiler can create for pure-loads / pure-stores. – Peter Cordes Apr 12 '22 at 09:59