In MSVC, why do InterlockedOr and InterlockedAnd generate a loop instead of a simple locked instruction?

Question

On MSVC for x64 (19.10.25019),

    InterlockedOr(&g, 1)

generates this code sequence:

    prefetchw BYTE PTR ?g@@3JC
    mov     eax, DWORD PTR ?g@@3JC                  ; g
    npad    3
$LL3@f:
    mov     ecx, eax
    or      ecx, 1
    lock cmpxchg DWORD PTR ?g@@3JC, ecx             ; g
    jne     SHORT $LL3@f

I would have expected the much simpler (and loopless):

    mov      eax, 1
    lock or  [?g@@3JC], eax

InterlockedAnd generates analogous code to InterlockedOr.

It seems wildly inefficient to have to have a loop for this instruction. Why is this code generated?

(As a side note: the whole reason I was using InterlockedOr was to do an atomic load of the variable - I have since learned that InterlockedCompareExchange is the way to do this. It is odd to me that there is no InterlockedLoad(&x), but I digress...)

BTW, if there is an atomic-load function, it probably wouldn't be called "interlocked" because that name doesn't really apply. It's a one-way operation, not RMW, so no locking is required. (And x86 has no way to do it for pointers that cross an 8B or cache-line boundary except with `lock cmpxchg` (which fails on read-only memory) [because `lock mov` doesn't exist](https://stackoverflow.com/questions/36624881/why-is-integer-assignment-on-a-naturally-aligned-variable-atomic-on-x86/36685056#36685056).) — Peter Cordes, Oct 30 '17 at 18:52

score 5 · Accepted Answer · edited Jun 20 '20 at 09:12

The documented contract for InterlockedOr has it returning the original value:

InterlockedOr

Performs an atomic OR operation on the specified LONG values. The function prevents more than one thread from using the same variable simultaneously.
LONG __cdecl InterlockedOr(
  _Inout_ LONG volatile *Destination,
  _In_    LONG          Value
);
Parameters:

Destination [in, out]
A pointer to the first operand. This value will be replaced with the result of the operation.

Value [in]
The second operand.

Return value

The function returns the original value of the Destination parameter.

This is why the unusual code that you've observed is required. The compiler cannot simply emit an OR instruction with a LOCK prefix, because the OR instruction does not return the previous value. Instead, it has to use the odd workaround with LOCK CMPXCHG in a loop. In fact, this apparently unusual sequence is the standard pattern for implementing interlocked operations when they aren't natively supported by the underlying hardware: capture the old value, perform an interlocked compare-and-exchange with the new value, and keep trying in a loop until the old value from this attempt is equal to the captured old value.

As you observed, you see the same thing with InterlockedAnd, for exactly the same reason: the x86 AND instruction doesn't return the original value, so the code-generator has to fallback on the general pattern involving compare-and-exchange, which is directly supported by the hardware.

Note that, at least on x86 where InterlockedOr is implemented as an intrinsic, the optimizer is smart enough to figure out whether you're using the return value or not. If you are, then it uses the workaround code involving CMPXCHG. If you are ignoring the return value, then it goes ahead and emits code using LOCK OR, just like you would expect.

#include <intrin.h>


LONG InterlockedOrWithReturn()
{
    LONG val = 42;
    return _InterlockedOr(&val, 8);
}

void InterlockedOrWithoutReturn()
{
    LONG val = 42;
    LONG old = _InterlockedOr(&val, 8);
}

InterlockedOrWithoutReturn, COMDAT PROC
        mov      DWORD PTR [rsp+8], 42
        lock or  DWORD PTR [rsp+8], 8
        ret      0
InterlockedOrWithoutReturn ENDP

InterlockedOrWithReturn, COMDAT PROC
        mov          DWORD PTR [rsp+8], 42
        prefetchw    BYTE PTR [rsp+8]
        mov          eax, DWORD PTR [rsp+8]
LoopTop:
        mov          ecx, eax
        or           ecx, 8
        lock cmpxchg DWORD PTR [rsp+8], ecx
        jne          SHORT LoopTop
        ret          0
InterlockedOrWithReturn ENDP

The optimizer is equally as smart for InterlockedAnd, and should be for the other Interlocked* functions, as well.

As intuition would tell you, the LOCK OR implementation is more efficient than the LOCK CMPXCHG in a loop. Not only is there the expanded code size and the overhead of looping, but you risk branch prediction misses, which can cost a large number of cycles. In performance-critical code, if you can avoid relying on the return value for interlocked operations, you can gain a performance boost.

However, what you really should be using in modern C++ is std::atomic, which allows you to specify the desired memory model/semantics, and then let the standard library maintainers deal with the complexity.

I appreciate your explanation, thank you! I'm not sure that `std::atomic` does the trick. For instance, `std::atomic::load` compiles to a simple mov instruction. That doesn't guarantee that I'm seeing the latest value of the variable, right? — R. Alabaster, Oct 30 '17 at 16:42
It actually does make that guarantee, @R.Alabaster. On x86, the `MOV` instruction is sufficient for a sequentially consistent atomic load, so long as the value you're reading from memory is aligned. That's almost always going to be the case, and the compiler knows it is in your test case, so it emits the simplest code that it can. — Cody Gray - on strike, Oct 30 '17 at 16:46
Worth mentioning that x86 has an `xadd` instruction, and `lock xadd` implements fetch_add / InterlockedAdd so in that one case it doesn't need a cmpxchg loop to return the old value. xadd with 0 would be a really terrible way to implement an atomic load though, unless you really need to support the case where the shared variable is misaligned. (And even then, you'd probably want to branch at runtime to see if it is aligned, if that's at all likely.) — Peter Cordes, Oct 30 '17 at 18:56
@R.Alabaster: You have to be careful what you mean by "latest". A `lock xadd` is a barrier for earlier stores, so its load can't take a value from L1D cache until all of this thread's stores are already globally visible. A simple seq_cst load is just an acquire load, and doesn't prevent StoreLoad reordering, so it can take its value earlier. The value it gets is always the current value that all caches agree is the current value of that memory location, though, because they are coherent and there is a total store order (on x86). By using a slower load, you could get a value later... — Peter Cordes, Oct 31 '17 at 06:36
@R.Alabaster: Normally what you'd do if you needed a barrier against reordering with earlier stores is to make the last store before this load a seq_cst atomic store. (i.e. an `xchg`, or store + mfence). You often don't need this, but it is the default for `std::atomic<>` (e.g. `x = 42`) because it's easier to reason about. If you want better performance, you have to realize that progress in separate threads is asynchronous, and you *want* them to be overlapping their work. **`lock xadd` doesn't make the data arrive faster, it just makes your thread wait for it until after your stores.** — Peter Cordes, Oct 31 '17 at 06:43

In MSVC, why do InterlockedOr and InterlockedAnd generate a loop instead of a simple locked instruction?

1 Answers1

`InterlockedOr`