Atomicity of 32bit read on multicore CPU

Question

(Note: I've added tags to this question based on where I feel will people will be who are likely to be able to help, so please don't shout:))

In my VS 2017 64bit project, I have a 32bit long value m_lClosed. When I want to update this, I use one of the Interlocked family of functions.

Consider this code, executing on thread #1

LONG lRet = InterlockedCompareExchange(&m_lClosed, 1, 0);   // Set m_lClosed to 1 provided it's currently 0

Now consider this code, executing on thread #2:

if (m_lClosed) // Do something

I understand that on a single CPU, this will not be a problem because the update is atomic and the read is atomic too (see MSDN), so thread pre-emption cannot leave the variable in a partially updated state. But on a multicore CPU, we really could have both these pieces of code executing in parallel if each thread is on a different CPU. In this example, I don't think that would be a problem, but it still feels wrong to be testing something that is in the process of possibly being updated.

This webpage tells me that atomicity on multiple CPUs is achieved via the LOCK assembly instruction, preventing other CPUs from accessing that memory. That sounds like what I need, but the assembly language generated for the if test above is merely

cmp   dword ptr [l],0

... no LOCK instruction in sight.

How in an event like this are we supposed to ensure atomicity of the read?

EDIT 24/4/18

Firstly thanks for all the interest this question has generated. I show below the actual code; I purposely kept it simple to focus on the atomicity of it all, but clearly it would have been better if I had showed it all from minute one.

Secondly, the project in which the actual code lives is a VS2005 project; hence no access to C++11 atomics. That's why I didn't add the C++11 tag to the question. I am using VS2017 with a "scratch" project to save having to build the huge VS2005 one every time I make a change whilst I am learning. Plus, its a better IDE.

Right, so the actual code lives in an IOCP driven server, and this whole atomicity is about handling a closed socket:

class CConnection
{
    //...

    DWORD PostWSARecv()
    {
        if (!m_lClosed)
            return ::WSARecv(...);
        else
            return WSAESHUTDOWN;
    }

    bool SetClosed()
    {
        LONG lRet = InterlockedCompareExchange(&m_lClosed, 1, 0);   // Set m_lClosed to 1 provided it's currently 0
        // If the swap was carried out, the return value is the old value of m_lClosed, which should be 0.
        return lRet == 0;
    }

    SOCKET m_sock;
    LONG m_lClosed;
};

The caller will call SetClosed(); if it returns true, it will then call ::closesocket() etc. Please don't question why it is that way, it just is :)

Consider what happens if one thread closes the socket whilst another tries to post a WSARecv(). You might think that the WSARecv() will fail (the socket is closed after all!); however what if a new connection is established with the same socket handle as that which we just closed - we would then be posting the WSARecv() which will succeed, but this would be fatal for my program logic since we are now associating a completely different connection with this CConnection object. Hence, I have the if (!m_lClosed) test. You could argue that I shouldn't be handling the same connection in multiple threads, but that is not the point of this question :)

That is why I need to test m_lClosed before I make the WSARecv() call.

Now, clearly, I am only setting m_lClosed to a 1, so a torn read/write is not really a concern, but it is the principle I am concerned about. What if I set m_lClosed to 2147483647 and then test for 2147483647? In this case, a torn read/write would be more problematic.

Maybe you have to look at the problem from a different perspective. Even using an atomic compare don't prevent you from executing the instructions related to the resulting branch while the object is modified from a concurrent thread. If this could your case probably you need a different approach. Think about semaphores or mutexes. — Frankie_C, Apr 23 '18 at 20:20
Can you make use of the C11 Atomic types? http://en.cppreference.com/w/c/atomic — Christian Gibbons, Apr 23 '18 at 20:21
Anyway from C11 or C++11 you can declare the variable `Atomic` and all access will be done atomically by the compiler. — Frankie_C, Apr 23 '18 at 20:23
Can't you just do `LONG lRet = InterlockedCompareExchange(&m_lClosed, 1, 1);` in thread 2? — user3386109, Apr 23 '18 at 20:27
any lock/atomic here not help. problem in code logic. not need and lock on read. after you read `m_lClosed` it cam just changed. — RbMm, Apr 23 '18 at 20:28
Thanks all. I have asked another question about the C++11 atomics because when I 'load' the value there is no locking or `Interlocked` function call - see https://stackoverflow.com/q/49986209/2369597 which also describes what happens with the `if` test above. The concern is not about locking a critical section of code, it is simply about not having torn reads/writes on `m_lClosed` — Wad, Apr 23 '18 at 20:34
@user3386109 - haha, nice. I hadn't thought of that. I'm go sleep on that and think about it, but I feel that may be the answer to my problem... — Wad, Apr 23 '18 at 20:35
The OP is asking about "partially updated state" / torn reads, not the correctness/sanity of the posted code. It would be better if the code was changed to `if (m_lClosed != 666)` or something like that. Or to put it another way, do you need `Interlocked[Compare]Exchange` to read the variable or can you just read it normally with a `MOV` etc. — Anders, Apr 23 '18 at 20:35
really `if (m_lClosed)` or `InterlockedCompareExchange(&m_lClosed, 1, 1);` must not be at all. this general logic error. must be only `if (InterlockedCompareExchange(&m_lClosed, 1, 0))` and task todo based on this — RbMm, Apr 23 '18 at 20:39
If the only two values are `0` and `1`, I have a hard time seeing how this can be "partially updated". There is only a single bit difference, and that bit cannot really be partial. :-) The bigger problem is that after `cmp dword ptr [l],0`, with or without a LOCK, some other CPU might change the value a nanosecond later. — Bo Persson, Apr 23 '18 at 20:55
@BoPersson A write to something larger than a byte can be torn, another thread might update just the upper 16-bits before you read and the lower 16-bits after/too late. — Anders, Apr 23 '18 at 20:59
@Anders - But if you only ever write `0` and `1` to the variable, how could that happen? — Bo Persson, Apr 23 '18 at 21:10
@BoPersson In your specific example, probably not but with something "medium level" like C++ the compiler might be smarter than you and come up with some crazy optimization even if you just wrote `x = 1` in your source code. — Anders, Apr 23 '18 at 21:13
here question in another - why *Do something* after `if (m_lClosed)` but not after `if ( InterlockedCompareExchange(&m_lClosed, 1, 0)==0)` what code after first and second statement. this look invalid pattern selected at begin and problem not in interlocked check, but in another. this is related how correct clean object state (always and once) after close — RbMm, Apr 23 '18 at 21:19
I prefer to use `InterlockedExchangeAdd()` instead: `if (InterlockedExchangeAdd(&m_lClosed, 0) > 0) // Do something` — Remy Lebeau, Apr 24 '18 at 01:13
Thanks all. I've edited the question to show what is really going on. — Wad, Apr 24 '18 at 13:55
Thanks for joining Raymond. By synchronized, I take it you mean that if I update `m_lClosed` in one thread, the new value won't necessarily be seen in the other thread that reads it, so it could still read (and act on) the old value? — Wad, Apr 24 '18 at 14:18
@user3386109, can you post an answer with your comment please because that is the solution I am going to use, otherwise I will post one and accept it as the answer. Thanks. — Wad, Apr 25 '18 at 20:40
@Wad I'm OK with you posting the answer. You've done a lot more thinking about this than I have :) — user3386109, Apr 25 '18 at 20:48
your code anyway wrong by design - after you read `m_lClosed` and it say false - you begin `WSARecv`, but just after that `m_lClosed` can be set to true in `SetClosed()`. you need [Run-Down Protection](https://learn.microsoft.com/en-us/windows-hardware/drivers/kernel/run-down-protection) for socket state here, instead any interlocked. so you faster need another question - how correct implement this — RbMm, May 01 '18 at 19:27
*That is why I need to test m_lClosed before I make the WSARecv() call.* - any test of `m_lClosed` nothing give you, because sockect can be closed after this test. however correct and nice solution exist — RbMm, May 01 '18 at 19:34

score 10 · Answer 1 · edited Jun 20 '20 at 09:12

It really depends on your compiler and the CPU you are running on.

x86 CPUs will atomically read 32-bit values without the LOCK prefix if the memory address is properly aligned. However, you most likely will need some sort of memory barrier to control the CPUs out-of-order execution if the variable is used as a lock/count of some other related data. Data that is not aligned might not be read atomically, especially if the value straddles a page boundary.

If you are not hand coding assembly you also need to worry about the compilers reordering optimizations.

Any variable marked as volatile will have ordering constraints in the compiler (and possibly the generated machine code) when compiling with Visual C++:

The _ReadBarrier, _WriteBarrier, and _ReadWriteBarrier compiler intrinsics prevent compiler re-ordering only. With Visual Studio 2003, volatile to volatile references are ordered; the compiler will not re-order volatile variable access. With Visual Studio 2005, the compiler also uses acquire semantics for read operations on volatile variables and release semantics for write operations on volatile variables (when supported by the CPU).

Microsoft specific volatile keyword enhancements:

When the /volatile:ms compiler option is used—by default when architectures other than ARM are targeted—the compiler generates extra code to maintain ordering among references to volatile objects in addition to maintaining ordering to references to other global objects. In particular:

A write to a volatile object (also known as volatile write) has Release semantics; that is, a reference to a global or static object that occurs before a write to a volatile object in the instruction sequence will occur before that volatile write in the compiled binary.

A read of a volatile object (also known as volatile read) has Acquire semantics; that is, a reference to a global or static object that occurs after a read of volatile memory in the instruction sequence will occur after that volatile read in the compiled binary.

This enables volatile objects to be used for memory locks and releases in multithreaded applications.

For architectures other than ARM, if no /volatile compiler option is specified, the compiler performs as if /volatile:ms were specified; therefore, for architectures other than ARM we strongly recommend that you specify /volatile:iso, and use explicit synchronization primitives and compiler intrinsics when you are dealing with memory that is shared across threads.

Microsoft provides compiler intrinsics for most of the Interlocked* functions and they will compile to something like LOCK XADD ... instead of a function call.

Until "recently", C/C++ had no support for atomic operations or threads in general but this changed in C11/C++11 where atomic support was added. Using the <atomic> header and its types/functions/classes moves the alignment and reordering responsibility to the compiler so you don't have to worry about that. You still have to make a choice regarding memory barriers and this determines the machine code generated by the compiler. With relaxed memory order, the load atomic operation will most likely end up as a simple MOV instruction on x86. A stricter memory order can add a fence and possibly the LOCK prefix if the compiler determines that the target platform requires it.

OP is on `VS2017`, so we should at least mention ``. Volatile has nothing to do with atomic, except on MSVC. However, even on MSVC, MS itself [strongly recommends](https://msdn.microsoft.com/en-us/library/jj204392.aspx) to use `volatile:iso`, i.e. to use the standard semantics for `volatile` (no memory barriers). — sbabbi, Apr 23 '18 at 22:16
Thank you. I feel that in this case, the answer to my problem to ensure an atomic test of `m_lClosed` is that which was posted at https://stackoverflow.com/questions/49989123/atomicity-of-32bit-read-on-multicore-cpu?noredirect=1#comment86995327_49989123. Would you agree/disagree/have any comments? — Wad, Apr 25 '18 at 10:40
Reading with a Interlocked* function is fine, the only question is speed. Instructions with the `LOCK` prefix will lock the bus and can slow down other CPU cores if you are accessing your value from many threads at the same time. Using atomics from the C++ standard should be faster on x86. — Anders, Apr 25 '18 at 11:08
Hi @Anders, sorry but for some reason I didn't get notification of this message. 1) Are you saying that the Interlocked family of functions definitely add `LOCK` (it would make sense for them to). 2) Why are atomics faster on x86 (but x64?) Should I ask a separate question for that? — Wad, Apr 25 '18 at 19:29
**1)** Yes, they add `LOCK`. (Remember that these functions real purpose is to write to memory) **2)** C++ atomics are faster because they can leave out the `LOCK` prefix when it is not required (and they are probably inlined instead of a function call). They can probably leave it out on 32 and 64-bit x86 but perhaps not on ARM etc. but the compiler knows so you don't have to worry about it. — Anders, Apr 25 '18 at 19:41
Thanks. 1) But how can it know to leave `LOCK` out or not? The compiler doesn't know if I will run my binary on a multi core CPU or not, and the `LOCK` is essential in this case. 2) Furthermore, don't the Interlocked family of functions boil down to a single assembly instruction, hence not a function call at all? Most `atomic` functions use the Interlocked family under the hood anyway for updating variables, but not reading them (see my other Q: https://stackoverflow.com/q/49986209/2369597). 3) Does `LOCK` synchronize all CPU caches/prob queues? — Wad, Apr 25 '18 at 19:54
**1)** It has nothing to do with multi-core or not, it is based on the compilers knowledge of the CPU regardless of core count and the compiler probably knows that a aligned 32-bit read is safe. **2)** `Interlocked*` are exported functions in kernel32.dll, the `_Interlocked*` intrinsics are inline. — Anders, Apr 25 '18 at 20:10
Thanks. I'll most likely post another question about this and post a link to it here; if you would be kind enough to dispense more wisdom it would be greatly appreciated. — Wad, Apr 25 '18 at 20:38

score 4 · Answer 2 · answered Apr 23 '18 at 22:26

In C++11, an unsynchronized access to a non-atomic object (such as m_lClosed) is undefined behavior.

The standard provides all the facilities you need to write this correctly; you do not need non-portable functions such as InterlockedCompareExchange. Instead, simply define your variable as atomic:

std::atomic<bool> m_lClosed{false};

// Writer thread...
bool expected = false;
m_lClosed.compare_exhange_strong(expected, true);

// Reader...
if (m_lClosed.load()) { /* ... */ }

This is more than sufficient (it forces sequential consistency, which might be expensive). In some cases it might be possible to generate slightly faster code by relaxing the memory order on the atomic operations, but I would not worry about that.

Thanks; I was already aware of C++11 atomics but didn't tag this question with C++11 as I can't use them on the project proper I'm working on. — Wad, Apr 24 '18 at 18:45

Wad · Answer 3 · 2018-04-26T09:47:30.467

As I posted here, this question was never about protecting a critical section of code, it was purely about avoiding torn read/writes. user3386109 posted a comment here which I ended up using, but declined posting it as an answer here. Thus I am providing the solution I ended up using for completeness of this question; maybe it will help someone in the future.

The following shows the atomic setting and testing of m_lClosed:

long m_lClosed = 0;

Thread 1

// Set flag to closed
if (InterlockedCompareExchange(&m_lClosed, 1, 0) == 0)
    cout << "Closed OK!\n";

Thread 2

This code replaces if (!m_lClosed)

if (InterlockedCompareExchange(&m_lClosed, 0, 0) == 0)
    cout << "Not closed!";

This is nonsense. Reading and writing of aligned data is atomic. You cannot suffer tearing. Since `InterlockedCompareExchange` requires alignment of its first argument, it is completely pointless to do this. — David Heffernan, May 01 '18 at 14:39

score -1 · Accepted Answer · answered May 01 '18 at 15:46

-1

OK so as it turns out this really isn't necessary; this answer explains in detail why we don't need to use any interlocked operations for a simple read/write (but we do for a read-modify-write).

answered May 01 '18 at 15:46

Wad

1,454
1
16
33

Atomicity of 32bit read on multicore CPU

4 Answers4

Linked