C++11 on modern Intel: am I crazy or are non-atomic aligned 64-bit load/store actually atomic?

Question

Can I base a mission-critical application on the results of this test, that 100 threads reading a pointer set a billion times by a main thread never see a tear?

Any other potential problems doing this besides tearing?

Here's a stand-alone demo that compiles with g++ -g tear.cxx -o tear -pthread.

#include <atomic>
#include <thread>
#include <vector>

using namespace std;

void* pvTearTest;
atomic<int> iTears( 0 );

void TearTest( void ) {

  while (1) {
      void* pv = (void*) pvTearTest;

      intptr_t i = (intptr_t) pv;

      if ( ( i >> 32 ) != ( i & 0xFFFFFFFF ) ) {
          printf( "tear: pv = %p\n", pv );
          iTears++;
      }
      if ( ( i >> 32 ) == 999999999 )
          break;

  }
}



int main( int argc, char** argv ) {

  printf( "\n\nTEAR TEST: are normal pointer read/writes atomic?\n" );

  vector<thread> athr;

  // Create lots of threads and have them do the test simultaneously.

  for ( int i = 0; i < 100; i++ )
      athr.emplace_back( TearTest );

  for ( int i = 0; i < 1000000000; i++ )
      pvTearTest = (void*) (intptr_t)
                   ( ( i % (1L<<32) ) * 0x100000001 );

  for ( auto& thr: athr )
      thr.join();

  if ( iTears )
      printf( "%d tears\n", iTears.load() );
  else
      printf( "\n\nTEAR TEST: SUCCESS, no tears\n" );
}

The actual application is a malloc()'ed and sometimes realloc()'d array (size is power of two; realloc doubles storage) that many child threads will absolutely be hammering in a mission-critical but also high-performance-critical way.

From time to time a thread will need to add a new entry to the array, and will do so by setting the next array entry to point to something, then increment an atomic<int> iCount. Finally it will add data to some data structures that would cause other threads to attempt to dereference that cell.

It all seems fine (except I'm not positive if the increment of count is assured of happening before following non-atomic updates)... except for one thing: realloc() will typically change the address of the array, and further frees the old one, the pointer to which is still visible to other threads.

OK, so instead of realloc(), I malloc() a new array, manually copy the contents, set the pointer to the array. I would free the old array but I realize other threads may still be accessing it: they read the array base; I free the base; a third thread allocates it writes something else there; the first thread then adds the indexed offset to the base and expects a valid pointer. I'm happy to leak those though. (Given the doubling growth, all old arrays combined are about the same size as the current array so overhead is simply an extra 16 bytes per item, and it's memory that soon is never referenced again.)

So, here's the crux of the question: once I allocate the bigger array, can I write it's base address with a non-atomic write, in utter safety? Or despite my billion-access test, do I actually have to make it atomic<> and thus slow all worker threads to read that atomic?

(As this is surely environment dependent, we're talking 2012-or-later Intel, g++ 4 to 9, and Red Hat of 2012 or later.)

EDIT: here is a modified test program that matches my planned scenario much more closely, with only a small number of writes. I've also added a count of the reads. I see when switching from void* to atomic I go from 2240 reads/sec to 660 reads/sec (with optimization disabled). The machine language for the read is shown after the source.

#include <atomic>
#include <chrono>
#include <thread>
#include <vector>

using namespace std;

chrono::time_point<chrono::high_resolution_clock> tp1, tp2;

// void*: 1169.093u 0.027s 2:26.75 796.6% 0+0k 0+0io 0pf+0w
// atomic<void*>: 6656.864u 0.348s 13:56.18 796.1%        0+0k 0+0io 0pf+0w

// Different definitions of the target variable.
atomic<void*> pvTearTest;
//void* pvTearTest;

// Children sum the tears they find, and at end, total checks performed.
atomic<int> iTears( 0 );
atomic<uint64_t> iReads( 0 );

bool bEnd = false; // main thr sets true; children all finish.

void TearTest( void ) {

  uint64_t i;
  for ( i = 0; ! bEnd; i++ ) {

      intptr_t iTearTest = (intptr_t) (void*) pvTearTest;

      // Make sure top 4 and bottom 4 bytes are the same.  If not it's a tear.
      if ( ( iTearTest >> 32 ) != ( iTearTest & 0xFFFFFFFF ) ) {
          printf( "tear: pv = %ux\n", iTearTest );
          iTears++;
      }

      // Output periodically to prove we're seeing changing values.
      if ( ( (i+1) % 50000000 ) == 0 )
          printf( "got: pv = %lx\n", iTearTest );
  }

  iReads += i;
}



int main( int argc, char** argv ) {

  printf( "\n\nTEAR TEST: are normal pointer read/writes atomic?\n" );

  vector<thread> athr;

  // Create lots of threads and have them do the test simultaneously.

  for ( int i = 0; i < 100; i++ )
      athr.emplace_back( TearTest );

  tp1 = chrono::high_resolution_clock::now();

#if 0
  // Change target as fast as possible for fixed number of updates.
  for ( int i = 0; i < 1000000000; i++ )
      pvTearTest = (void*) (intptr_t)
                   ( ( i % (1L<<32) ) * 0x100000001 );
#else
  // More like our actual app: change target only periodically, for fixed time.
  for ( int i = 0; i < 100; i++ ) {
      pvTearTest.store( (void*) (intptr_t) ( ( i % (1L<<32) ) * 0x100000001 ),
                        std::memory_order_release );

      this_thread::sleep_for(10ms);
  }
#endif

  bEnd = true;

  for ( auto& thr: athr )
      thr.join();

  tp2 = chrono::high_resolution_clock::now();

  chrono::duration<double> dur = tp2 - tp1;
  printf( "%ld reads in %.4f secs: %.2f reads/usec\n",
          iReads.load(), dur.count(), iReads.load() / dur.count() / 1000000 );

  if ( iTears )
      printf( "%d tears\n", iTears.load() );
  else
      printf( "\n\nTEAR TEST: SUCCESS, no tears\n" );
}

Dump of assembler code for function TearTest():
   0x0000000000401256 <+0>:     push   %rbp
   0x0000000000401257 <+1>:     mov    %rsp,%rbp
   0x000000000040125a <+4>:     sub    $0x10,%rsp
   0x000000000040125e <+8>:     movq   $0x0,-0x8(%rbp)
   0x0000000000401266 <+16>:    movzbl 0x6e83(%rip),%eax        # 0x4080f0 <bEnd>
   0x000000000040126d <+23>:    test   %al,%al
   0x000000000040126f <+25>:    jne    0x40130c <TearTest()+182>
=> 0x0000000000401275 <+31>:    mov    $0x4080d8,%edi
   0x000000000040127a <+36>:    callq  0x40193a <std::atomic<void*>::operator void*() const>
   0x000000000040127f <+41>:    mov    %rax,-0x10(%rbp)
   0x0000000000401283 <+45>:    mov    -0x10(%rbp),%rax
   0x0000000000401287 <+49>:    sar    $0x20,%rax
   0x000000000040128b <+53>:    mov    -0x10(%rbp),%rdx
   0x000000000040128f <+57>:    mov    %edx,%edx
   0x0000000000401291 <+59>:    cmp    %rdx,%rax
   0x0000000000401294 <+62>:    je     0x4012bb <TearTest()+101>
   0x0000000000401296 <+64>:    mov    -0x10(%rbp),%rax
   0x000000000040129a <+68>:    mov    %rax,%rsi
   0x000000000040129d <+71>:    mov    $0x40401a,%edi
   0x00000000004012a2 <+76>:    mov    $0x0,%eax
   0x00000000004012a7 <+81>:    callq  0x401040 <printf@plt>
   0x00000000004012ac <+86>:    mov    $0x0,%esi
   0x00000000004012b1 <+91>:    mov    $0x4080e0,%edi
   0x00000000004012b6 <+96>:    callq  0x401954 <std::__atomic_base<int>::operator++(int)>
   0x00000000004012bb <+101>:   mov    -0x8(%rbp),%rax
   0x00000000004012bf <+105>:   lea    0x1(%rax),%rcx
   0x00000000004012c3 <+109>:   movabs $0xabcc77118461cefd,%rdx
   0x00000000004012cd <+119>:   mov    %rcx,%rax
   0x00000000004012d0 <+122>:   mul    %rdx
   0x00000000004012d3 <+125>:   mov    %rdx,%rax
   0x00000000004012d6 <+128>:   shr    $0x19,%rax
   0x00000000004012da <+132>:   imul   $0x2faf080,%rax,%rax
   0x00000000004012e1 <+139>:   sub    %rax,%rcx
   0x00000000004012e4 <+142>:   mov    %rcx,%rax
   0x00000000004012e7 <+145>:   test   %rax,%rax
   0x00000000004012ea <+148>:   jne    0x401302 <TearTest()+172>
   0x00000000004012ec <+150>:   mov    -0x10(%rbp),%rax
   0x00000000004012f0 <+154>:   mov    %rax,%rsi
   0x00000000004012f3 <+157>:   mov    $0x40402a,%edi
   0x00000000004012f8 <+162>:   mov    $0x0,%eax
   0x00000000004012fd <+167>:   callq  0x401040 <printf@plt>
   0x0000000000401302 <+172>:   addq   $0x1,-0x8(%rbp)
   0x0000000000401307 <+177>:   jmpq   0x401266 <TearTest()+16>
   0x000000000040130c <+182>:   mov    -0x8(%rbp),%rax
   0x0000000000401310 <+186>:   mov    %rax,%rsi
   0x0000000000401313 <+189>:   mov    $0x4080e8,%edi
   0x0000000000401318 <+194>:   callq  0x401984 <std::__atomic_base<unsigned long>::operator+=(unsigned long)>
   0x000000000040131d <+199>:   nop
   0x000000000040131e <+200>:   leaveq
   0x000000000040131f <+201>:   retq

Lots of UB things *happen to work* when you disable optimization!! That proves absolutely nothing about safety. But yes, `std::atomic` load/store with memory_order_relaxed compiles about the same as a `volatile int64_t`. (Or with optimization disabled, everything is sort of `volatile` for consistent debugging). That doesn't mean you should use it! [When to use volatile with multi threading?](https://stackoverflow.com/a/58535118) and [Why is integer assignment on a naturally aligned variable atomic on x86?](https://stackoverflow.com/q/36624881) — Peter Cordes, Apr 21 '20 at 17:11
I don't understand your edit or the changelog. `volatile` wasn't "confusing" me, but it was the only thing that made this work like mo_relaxed even if you enable optimization. Understanding exactly what volatile does is key to understanding why this happened to work with optimization disabled, instead of hoisting the load and sinking the store out of their respective loops. Now the first sentence of my last comment truly applies: lots of UB happens to work in debug mode / optimization disabled. — Peter Cordes, Apr 21 '20 at 17:46
[MCU programming - C++ O2 optimization breaks while loop](https://electronics.stackexchange.com/a/387478) and [Multithreading program stuck in optimized mode but runs normally in -O0](https://stackoverflow.com/q/58516052) — Peter Cordes, Apr 21 '20 at 17:47
I had volatile only to keep the assignment from being optimized out of the test loop that was updating it. Since everyone misunderstood that I was trying to use volatile as a feature of the proposed use, I've removed it. In fact the example is fine without it. — Swiss Frank, Apr 21 '20 at 17:47
"fine without it" - for an extremely limited definition of fine, which doesn't include compiling with `-O2` or `-O3` like you would for a normal program. Anyway, I think this is basically a duplicate of the x86 natural alignment Q&A. If anyone tags this x86-64 or c++, I can dup-hammer it. — Peter Cordes, Apr 21 '20 at 17:49
I appreciate the help Peter and didn't mean to imply either that you personally were confused or if so that it was your fault not mine. I don't know how to be clearer that I didn't mean to make volatile have any role in the question, than to completely delete all traces of it. — Swiss Frank, Apr 21 '20 at 17:49
Ok, I get what you're saying now. I'm still surprised that you felt the need to do that because my first comment already answers this version of the question; now it's a duplicate of [Why is integer assignment on a naturally aligned variable atomic on x86?](https://stackoverflow.com/q/36624881). — Peter Cordes, Apr 21 '20 at 17:53
This is the minimum compilable example program meant to demonstrate a behavior. By "fine" I only meant, that for the billion updates of the non-volatile, non-atomic pointer, all 100 threads observing it never saw it tear. Are you saying it would tear if compiled by -O2? Do you have a recommendation on how to change the example so that I CAN compile with -O2 but not have the assignment hoisted out of the loop? — Swiss Frank, Apr 21 '20 at 17:53
No, if you ran it with optimization enabled, you wouldn't see tearing because the testing would optimize away. Sorry I didn't think through what the symptoms of the UB would be in this case. You can't get x86-64 to tear an aligned qword load or store, but that doesn't make plain `int64_t` safe in C++. Like I explained in [Why is integer assignment on a naturally aligned variable atomic on x86?](https://stackoverflow.com/q/36624881). Especially if you're loading or storing a pointer, you probably want at least release / acquire, not the `mo_relaxed` you get from hacks. — Peter Cordes, Apr 21 '20 at 17:55
*and thus slow all worker threads to read that atomic?* - on x86, an `atomic` load has no extra cost even with seq_cst. Load into a local non-atomic tmp var if you need to use it multiple times, so the compiler can keep it in a register instead of re-reading from memory (like `volatile` would). — Peter Cordes, Apr 21 '20 at 20:14
@Peter Cordes: on x86, an atomic load has no extra cost even with seq_cst?? Converting from void* to atomic seems to slow reads/sec by 75%, from 2240/us to 660/us. I put the modified test program above, with disassembly of the reading function. The program is modified 1) to count reads and sum when test complete; 2) instead of hammering a billion writes as fast as possible, it now does 100 writes separated by 10ms pauses (the real world scenario is more like a write an hour). — Swiss Frank, Apr 22 '20 at 16:05
@SwissFrank you are measuring a debug build - try compiling with -O3. — mpoeter, Apr 22 '20 at 16:21
EXCELLENT, @mpoeter, yes, I forgot the goal was to optimize THEN test. For -O2 and -O3 I see 4151 to 4155 reads/usec for both void* and atomic. For -O I see 2196 and 2210. As you and Peter suggested, atomic<> has no read overhead at least with -O. Still... can you summarize what the advantage of atomic<> is, if it results in the same code for read? Is the write code different, or what? — Swiss Frank, Apr 22 '20 at 16:56
@SwissFrank: Correct, extra cost on x86 only for seq_cst stores, not loads. Like I explained in my answer to [Can atomic operations on a non-atomic<> pointer be safe and faster than atomic<>?](https://stackoverflow.com/q/61324676). (I started writing that because it popped up after someone else edited, but I thought you were asking a new question after I'd already explained in comments that you should just use `atomic`. I decided to finish answering once I realized it actually predated our comments here in case it was useful to anyone else.) — Peter Cordes, Apr 22 '20 at 19:45
@SwissFrank: *what the advantage of atomic<> is* - the most obvious is avoiding C++ UB, ordering guarantees portably across ISAs, and not needing any nasty hacks like `volatile` to roll your own atomics. `volatile` is like an unsafe version of `std::memory_order_relaxed`, with no portable atomicity *guarantees*, no atomic increments or other RMWs. If you didn't use *either* of those things (atomic or volatile) and just relied on non-inline function boundaries as compile-time memory barriers, see [Who's afraid of a big bad optimizing compiler?](https://lwn.net/Articles/793253/) — Peter Cordes, Apr 22 '20 at 19:49
@SwissFrank: Basically: do you think you understand memory ordering models, optimizing compilers, etc., in enough detail to safely and correctly implement your own atomics with `volatile`? Even if so, I wouldn't bother when we have `atomic` already offering standards-guaranteed correctness supported by compiler builtins like GNU C [`__atomic_load_n`](https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html). For example if you ever compiled for a 32-bit ISA, 64-bit atomic objects would need a bit of special asm (like using `movq xmm0, [mem]` on 32-bit x86) — Peter Cordes, Apr 22 '20 at 19:54
@SwissFrank: Like if you had a variable that really had to be exactly 32 bits for some bit-manipulation to work right, it's better to declare it `uint32_t` even if you're compiling for x86 where `unsigned` happens to be the same type. Although with atomics the case is even stronger because the ISO C++ standard literally leaves the bahaviour undefined for a data race on a `volatile`. (In practice compilers do define it as whatever the HW memory model says that means). **For your case, one concrete gain is going to be release-store ordering at compile time when you update the pointer.** — Peter Cordes, Apr 22 '20 at 19:58
OK, I have it written... but may have found a race condition. This T is char** apsz[]. If a reader calls apsz[x], are they guaranteed to load x before apsz? Or is there a simple way to rewrite "context.apsz[x]" such that x is loaded before apsz? The worry is that thread T1 reads context.apsz then loses CPU. Thread T2 wants to add a new string to apsz and it's full, so it allocates a new apsz of twice the cells; copies from old context.apsz, adds new string to end of the new apsz, atomically writes the new apsz to context (leaking the old apsz but acceptable), returns. (to be con't) — Swiss Frank, Apr 24 '20 at 15:34
(con't) The application (still T2) then stores the subscript of this new string in globally accessible variable x. T1 wakes up, reads the array subscript x, gets the offset that is present in the new apsz but past the end of the old one; walks off the end of the old apsz it's just loaded, and coredump or worse. If I can force the read of x before the read of apsz, and depend on apsz being updated before a new array subscript were ever written, I'd be safe. — Swiss Frank, Apr 24 '20 at 15:36

score 3 · Answer 1 · answered Apr 21 '20 at 09:42

3

Yes, on x86 aligned loads are atomic, BUT this is an architectural detail that you should NOT rely on!

Since you are writing C++ code, you have to abide by the rules of the C++ standard, i.e., you have to use atomics instead of volatile. The fact that volatile has been part of that language long before the introduction of threads in C++11 should be a strong enough indication that volatile was never designed or intended to be used for multi-threading. It is important to note that in C++ volatile is something fundamentally different from volatile in languages like Java or C# (in these languages volatile is in fact related to the memory model and therefore much more like an atomic in C++).

In C++, volatile is used for what is often referred to as "unusual memory". This is typically memory that can be read or modified outside the current process, for example when using memory mapped I/O. volatile forces the compiler to execute all operations in the exact order as specified. This prevents some optimizations that would be perfectly legal for atomics, while also allowing some optimizations that are actually illegal for atomics. For example:

volatile int x;
         int y;
volatile int z;

x = 1;
y = 2;
z = 3;
z = 4;

...

int a = x;
int b = x;
int c = y;
int d = z;

In this example, there are two assignments to z, and two read operations on x. If x and z were atomics instead of volatile, the compiler would be free to treat the first store as irrelevant and simply remove it. Likewise it could just reuse the value returned by the first load of x, effectively generating code like int b = a. But since x and z are volatile, these optimizations are not possible. Instead, the compiler has to ensure that all volatile operations are executed in the exact order as specified, i.e., the volatile operations cannot be reordered with respect to each other. However, this does not prevent the compiler from reordering non-volatile operations. For example, the operations on y could freely be moved up or down - something that would not be possible if x and z were atomics. So if you were to try implementing a lock based on a volatile variable, the compiler could simply (and legally) move some code outside your critical section.

Last but not least it should be noted that marking a variable as volatile does not prevent it from participating in a data race. In those rare cases where you have some "unusual memory" (and therefore really require volatile) that is also accessed by multiple threads, you have to use volatile atomics.

Since aligned loads are actually atomic on x86, the compiler will translate an atomic.load() call to a simple mov instruction, so an atomic load is not slower than reading a volatile variable. An atomic.store() is actually slower than writing a volatile variable, but for good reasons, since in contrast to the volatile write it is by default sequentially consistent. You can relax the memory orders, but you really have to know what you are doing!!

If you want to learn more about the C++ memory model, I can recommend this paper: Memory Models for C/C++ Programmers

answered Apr 21 '20 at 09:42

mpoeter

2,574
1
5
12

Thx both for help today and a couple days ago. OK, I'll do timer tests but: atomic sounds like it's no slower than X* in this case for read. And I'll be having 10^6 reads/sec and only 20 or so max writes/day. I'm only a C/C++ guy so won't be misled by assuming parallels with Java/C#. Question, though: you say "this is an architectural detail that you should NOT rely on." I take your objections to atomic vs. volatile T*, but what about atomic vs. play T*? What are the reasons I shouldn't rely on that other than sequential consistency? – Swiss Frank Apr 21 '20 at 09:56
2

The answer is simple: undefined behavior! The standard states that "The execution of a program contains a _data race_ if it contains two potentially concurrent conﬂicting actions, at least one of which is not atomic, and neither happens before the other [...] Any such data race results in _undeﬁned behavior_." So if you use a plain T* instead of an atomic, and you have read+write operations that are not ordered some other way (i.e., there is a happens-before relation between these operations), you find yourself in the land of undefined behavior. – mpoeter Apr 21 '20 at 12:29
1

@SwissFrank and there is a good reason that atomic operations are sequentially consistent be default, because the more relaxed memory models are quite difficult to grasp, and you will likely run into some unexpected situations if you don't know exactly what you are doing. – mpoeter Apr 21 '20 at 12:31
1

@SwissFrank: If you want code that compiles like it would with `volatile`, but without UB, use `atomic` with `memory_order_relaxed` - [When to use volatile with multi threading?](https://stackoverflow.com/a/58535118) - basically never. On x86 the hardware memory ordering model is (stronger than) C++ acq_rel. `volatile` doesn't guarantee much about compile-time reordering but at runtime it will be like C++ mo_acq_rel on x86, or like mo_relaxed on most other ISAs. Therefore it's a terrible choice and will give you "happens to work" code that's not future-proof or portable for no benefit. – Peter Cordes Apr 21 '20 at 17:17
Just to be clear, I marked the var volatile so that the loop modifying it wouldn't optimize that out, not because I I thought volatile was important to the multithread aspect of the problem. – Swiss Frank Apr 21 '20 at 17:20
OK, since it was confusing everyone into thinking it was my main point, I removed the volatile. The results are identical. – Swiss Frank Apr 21 '20 at 17:31
By switching the void* to atomic I am seeing the CPU and elapsed time go up by a factor of 5-6x. I'm still writing a billion times though I would have guessed with 100 reading threads, the time spent reading would have dominated. I'll reduce the writes from a billion to a dozen (more in line with what would actually happen) and see if atomic<> still makes a huge difference. – Swiss Frank Apr 21 '20 at 18:01
2

@SwissFrank try using `std::memory_order_release` for the write operations, that should improve performance. – mpoeter Apr 21 '20 at 20:53
@mpoeter I've modified the test program to do 100 writes from main thread, each with sleep_for(10ms). Once that is done, all child threads, which are spinning reading and checking for tears, report how many iterations they've done. WHEN I SWITCH TO ATOMIC, the reads are about 1/4 the speed of when the variable is simply void*. `memory_order_release` for stores has no effect. The machine language shows a callq to to atomic::operator(void*) which may account for it? How to avoid that? (I'm adding amended test program above) – Swiss Frank Apr 22 '20 at 15:52
1

@SwissFrank: For the record, you need to enable optimization to get efficient asm from C++ template libraries. (Or in general). This was already discussed under the question, but this comment thread was left with the last comment getting 1/4 performance. Because of a debug build adding overhead and probably making the stores compile like `seq_cst`. – Peter Cordes Feb 28 '23 at 06:36

C++11 on modern Intel: am I crazy or are non-atomic aligned 64-bit load/store actually atomic?

1 Answers1

Linked