Why is there a locked xadd instruction in this disassambled std::string dtor?

Question

I have a very simple code:

#include <string>
#include <iostream>

int main() {
    std::string s("abc");
    std::cout << s;
}

Then, I compiled it:

g++ -Wall test_string.cpp -o test_string -std=c++17 -O3 -g3 -ggdb3

And then decompiled it, and the most interesting piece is:

00000000004009a0 <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10>:
  4009a0:       48 81 ff a0 11 60 00    cmp    rdi,0x6011a0
  4009a7:       75 01                   jne    4009aa <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0xa>
  4009a9:       c3                      ret    
  4009aa:       b8 00 00 00 00          mov    eax,0x0
  4009af:       48 85 c0                test   rax,rax
  4009b2:       74 11                   je     4009c5 <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0x25>
  4009b4:       83 c8 ff                or     eax,0xffffffff
  4009b7:       f0 0f c1 47 10          lock xadd DWORD PTR [rdi+0x10],eax
  4009bc:       85 c0                   test   eax,eax
  4009be:       7f e9                   jg     4009a9 <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0x9>
  4009c0:       e9 cb fd ff ff          jmp    400790 <_ZdlPv@plt>
  4009c5:       8b 47 10                mov    eax,DWORD PTR [rdi+0x10]
  4009c8:       8d 50 ff                lea    edx,[rax-0x1]
  4009cb:       89 57 10                mov    DWORD PTR [rdi+0x10],edx
  4009ce:       eb ec                   jmp    4009bc <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0x1c>

Why _ZNSs4_Rep10_M_disposeERKSaIcE.isra.10 (which is std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Rep::_M_dispose(std::allocator<char> const&) [clone .isra.10]) is a lock prefixed xadd?

A follow-up question is how I can avoid it?

Memory alloc/dealloc from a global allocator needs atomic operations (or locking) to be thread-safe. But I would have expected thread operations inside libc or libstdc++, not inlined. I'm also not sure this code ever runs; it looks like GCC jumps over it by testing that `0 == 0` first. Which is super-weird and so maybe the stuff after `test rax,rax` / `je` is unreachable and just a missed-optimization. — Peter Cordes, Jul 25 '19 at 15:31
What GCC version on what OS (what header version)? I can't repro this on the Godbolt compiler explorer (https://godbolt.org/z/AvOZj1) with Linux GCC 5/6/7/8/9 or clang 8, even with clang `-stdlib=libc++` (instead of the default libstdc++). This looks like probably real GCC output (no Apple Clang installed as `gcc`), based on the `.isra.10` clone name, but I could be wrong. — Peter Cordes, Jul 25 '19 at 15:37
@PeterCordes I was using GCC 8.2.1 on CentOS 7. Kernel version 3.10 — HCSF, Jul 25 '19 at 16:45
Please post assembly output, not "disassembly" (binary code expressed in asm). — curiousguy, Aug 12 '19 at 14:09

BeeOnRope · Accepted Answer · 2019-07-25T19:07:59.877

It looks like code associated with copy on write strings. The locked instruction is decrementing a reference count and then calling operator delete only if the reference count for the possibly shared buffer containing the actual string data is zero (i.e., it is not shared: no other string object refers to it).

Since libstdc++ is open source, we can confirm this by looking at the source!

The function you've disassembled, _ZNSs4_Rep10_M_disposeERKSaIcE de-mangles¹ to std::basic_string<char>::_Rep::_M_dispose(std::allocator<char> const&). Here's the corresponding source for libstdc++ in the gcc-4.x era²:

    void
    _M_dispose(const _Alloc& __a)
    {
#if _GLIBCXX_FULLY_DYNAMIC_STRING == 0
      if (__builtin_expect(this != &_S_empty_rep(), false))
#endif
        {
          // Be race-detector-friendly.  For more info see bits/c++config.
          _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&this->_M_refcount);
          if (__gnu_cxx::__exchange_and_add_dispatch(&this->_M_refcount,
                             -1) <= 0)
        {
          _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&this->_M_refcount);
          _M_destroy(__a);
        }
        }
    }  // XXX MT

Given that, we can annotate the assembly you provided, mapping each instruction back to the C++ source:

00000000004009a0 <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10>:

  # the next two lines implement the check:
  # if (__builtin_expect(this != &_S_empty_rep(), false))
  # which is an empty string optimization. The S_empty_rep singleton
  # is at address 0x6011a0 and if the current buffer points to that
  # we are done (execute the ret)
  4009a0: cmp    rdi,0x6011a0
  4009a7: jne    4009aa <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0xa>
  4009a9: ret

  # now we are in the implementation of
  # __gnu_cxx::__exchange_and_add_dispatch(&this->_M_refcount, -1)
  # which dispatches either to an atomic version of the add function
  # or the non-atomic version, depending on the value of `eax` which
  # is always directly set to zero, so the non-atomic version is 
  # *always called* (see details below)
  4009aa: mov    eax,0x0
  4009af: test   rax,rax
  4009b2: je     4009c5 <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0x25>

  # this is the atomic version of the decrement you were concerned about
  # but we never execute this code because the test above always jumps
  # to 4009c5 (the non-atomic version)
  4009b4: or     eax,0xffffffff
  4009b7: lock xadd DWORD PTR [rdi+0x10],eax
  4009bc: test   eax,eax
  # check if the result of the xadd was zero, if not skip the delete
  4009be: jg     4009a9 <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0x9>
  # the delete call
  4009c0: jmp    400790 <_ZdlPv@plt> # tailcall

  # the non-atomic version starts here, this is the code that is 
  # always executed
  4009c5: mov    eax,DWORD PTR [rdi+0x10]
  4009c8: lea    edx,[rax-0x1]
  4009cb: mov    DWORD PTR [rdi+0x10],edx
  # this jumps up to the test eax,eax check which calls operator delete
  # if the refcount was zero
  4009ce: jmp    4009bc <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0x1c>

A key note is that the lock xadd code you were concerned about is never executed. There is a mov eax, 0 followed by a test rax, rax; je - this test always succeeds and the jump always occurs because rax is always zero.

What's happening here is that __gnu_cxx::__atomic_add_dispatch is implemented in a way that it checks whether the process is definitely single threaded. If it is definitely single threaded, it doesn't bother to use expensive atomic instructions for things like __atomic_add_dispatch - it simply uses a regular non-atomic addition. It does this by checking the address of a pthreads function, __pthread_key_create - if this is zero, the pthread library hasn't been linked in, and hence the process is definitely single threaded. In your case, the address of this pthread function gets resolved at link time to 0 (you didn't have -lpthread on your compile command line), which is where the mov eax, 0x0 comes from. At link time, it's too late to optimize on this knowledge, so the vestigial atomic increment code remains but never executes. This mechanism is described in more detail in this answer.

The code that does execute is the last part of the function, starting at 4009c5. This code also decrements the reference count, but in a non-atomic way. The check which decides between these two options is probably based on whether the process is multithreaded or not, e.g., whether -lpthread has been linked. For whatever reason this check, inside __exchange_and_add_dispatch, is implemented in a way that prevents the compiler from actually removing the atomic half of the branch, even though the fact that it will never be taken is known at some point during the build process (after all, the hard-coded mov eax, 0 got there somehow).

A follow-up question is how I can avoid it?

Well you've already avoided the lock add part, so if that's what you care about, your good to go. However, you still have a cause for concern:

Copy on write std::string implementations are not standards compliant due to changes made in C++11, so the question remains why exactly you are getting this COW string behavior even when specifying -std=c++17.

The problem is most likely distribution related: CentOS 7 by default uses an ancient gcc version < 5 which still uses the non-compliant COW strings. However, you mention that you are using gcc 8.2.1, which by default in a normal install which uses non-COW strings. It seems like if you installed 8.2.1 use the RHEL "devtools" method, you'll get a new gcc which still uses the old ABI and links against the old system libstdc++.

To confirm this, you might want to check the value of _GLIBCXX_USE_CXX11_ABI macro in your test program, and also your libstdc++ version (the version information here might prove useful).

You can avoid by using an OS other than CentOS that doesn't use ancient gcc and glibc version. If you need to stick with CentOS for some reason you'll have to look into if there is a supported way to use newer libstdc++ version on that distribution. You could also consider using a containerization technology to build an executable independent of the library versions of your local host.

¹ You can demangle it like so: echo '_ZNSs4_Rep10_M_disposeERKSaIcE' | c++filt.

² I'm using gcc-4 era source since I'm guessing that's what you end up using in CentOS 7.

I wonder if that `mov eax,0` is from inline asm. Normally gcc isn't that dumb. — Peter Cordes, Jul 25 '19 at 17:26
@PeterCordes - yeah it's weird. It might actually be a "link time optimziation" or something like that. Not a *normal* link time optimization, but something like how the linker goes back and adjusts the TLS model depending on details only known at link time, by inserting specific assembly sequences into binaries which might be simpler than the conservative one the compiler originally generated ([details](https://www.uclibc.org/docs/tls.pdf)). — BeeOnRope, Jul 25 '19 at 17:29
@PeterCordes - [here's the answer](https://godbolt.org/z/G_Hr9Y). It's that thing where the atomic operations are compiled in a way that does the non-atomic version if `pthreads` isn't linked in, but the atomic one if it isn't. It is resolved at link time, so by then it is too late to go back and optimize the untaken side of the branch. The godbolt shows the check is resolved at build time for executables, but in shared libraries (i.e., -fPIC) code probably the check is a runtime one. — BeeOnRope, Jul 25 '19 at 18:49
And in a PIE executable, we just get `cmp QWORD PTR [rip+xxxx],0x0` for both, getting the value from the GOT entry :( BTW, is it just my browser or does Godbolt binary mode copy the machine code hex from one pane to all the other panes, separated by `|` when multiple panes have binary mode enabled? — Peter Cordes, Jul 25 '19 at 19:03
@PeterCordes - yeah I noticed the weirdness with every other line being hex, but didn't understand what was happening. Maybe it's some kind of attempt at a diff? — BeeOnRope, Jul 25 '19 at 19:07
@BeeOnRope yes I used GCC 8.2.1 in devtools which comes with a new libstdc++. That's odd that a new GCC uses old ABI and link against old system libstdc++ given new ABI is introduced in GCC 5. I will check ABI and enable more log to see what gets linked. Thanks — HCSF, Jul 26 '19 at 05:20
@HCSF - based on my quick search, that's how RHEL (and hence CentOS) dev tools is supposed to work: it uses the old ABI even with new compiler versions. See for example [this answer](https://stackoverflow.com/a/52611576/149138) which indicates that you are stuck on the old ABI with devtoolset gcc 7. To be clear, it is not a gcc problem per say, it's how RHEL have decided to implement distribution of newer gcc versions. See also [this bug](https://bugzilla.redhat.com/show_bug.cgi?id=1546704). — BeeOnRope, Jul 26 '19 at 05:38
So it means devtoolset 7's and 8's GCC were built with a mixed of static and dynamic libstdc++ and so abi macro flag is ignored; and so to solve this issue, I need to either move centos 7 or rebuild GCC that uses libstdc++ shared library instead? — HCSF, Jul 27 '19 at 14:02
@HCSF - I don't know, and I don't think we'll solve it here in the comments. If you have a specific question you could ask it as a new query. I try to stay away from CentOS at least partly for that reason. — BeeOnRope, Jul 27 '19 at 14:28

Why is there a locked xadd instruction in this disassambled std::string dtor?

1 Answers1