It looks like code associated with copy on write strings. The locked instruction is decrementing a reference count and then calling operator delete
only if the reference count for the possibly shared buffer containing the actual string data is zero (i.e., it is not shared: no other string object refers to it).
Since libstdc++ is open source, we can confirm this by looking at the source!
The function you've disassembled, _ZNSs4_Rep10_M_disposeERKSaIcE
de-mangles1 to std::basic_string<char>::_Rep::_M_dispose(std::allocator<char> const&)
. Here's the corresponding source for libstdc++ in the gcc-4.x era2:
void
_M_dispose(const _Alloc& __a)
{
#if _GLIBCXX_FULLY_DYNAMIC_STRING == 0
if (__builtin_expect(this != &_S_empty_rep(), false))
#endif
{
// Be race-detector-friendly. For more info see bits/c++config.
_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&this->_M_refcount);
if (__gnu_cxx::__exchange_and_add_dispatch(&this->_M_refcount,
-1) <= 0)
{
_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&this->_M_refcount);
_M_destroy(__a);
}
}
} // XXX MT
Given that, we can annotate the assembly you provided, mapping each instruction back to the C++ source:
00000000004009a0 <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10>:
# the next two lines implement the check:
# if (__builtin_expect(this != &_S_empty_rep(), false))
# which is an empty string optimization. The S_empty_rep singleton
# is at address 0x6011a0 and if the current buffer points to that
# we are done (execute the ret)
4009a0: cmp rdi,0x6011a0
4009a7: jne 4009aa <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0xa>
4009a9: ret
# now we are in the implementation of
# __gnu_cxx::__exchange_and_add_dispatch(&this->_M_refcount, -1)
# which dispatches either to an atomic version of the add function
# or the non-atomic version, depending on the value of `eax` which
# is always directly set to zero, so the non-atomic version is
# *always called* (see details below)
4009aa: mov eax,0x0
4009af: test rax,rax
4009b2: je 4009c5 <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0x25>
# this is the atomic version of the decrement you were concerned about
# but we never execute this code because the test above always jumps
# to 4009c5 (the non-atomic version)
4009b4: or eax,0xffffffff
4009b7: lock xadd DWORD PTR [rdi+0x10],eax
4009bc: test eax,eax
# check if the result of the xadd was zero, if not skip the delete
4009be: jg 4009a9 <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0x9>
# the delete call
4009c0: jmp 400790 <_ZdlPv@plt> # tailcall
# the non-atomic version starts here, this is the code that is
# always executed
4009c5: mov eax,DWORD PTR [rdi+0x10]
4009c8: lea edx,[rax-0x1]
4009cb: mov DWORD PTR [rdi+0x10],edx
# this jumps up to the test eax,eax check which calls operator delete
# if the refcount was zero
4009ce: jmp 4009bc <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0x1c>
A key note is that the lock xadd
code you were concerned about is never executed. There is a mov eax, 0
followed by a test rax, rax; je
- this test always succeeds and the jump always occurs because rax
is always zero.
What's happening here is that __gnu_cxx::__atomic_add_dispatch
is implemented in a way that it checks whether the process is definitely single threaded. If it is definitely single threaded, it doesn't bother to use expensive atomic instructions for things like __atomic_add_dispatch
- it simply uses a regular non-atomic addition. It does this by checking the address of a pthreads function, __pthread_key_create
- if this is zero, the pthread
library hasn't been linked in, and hence the process is definitely single threaded. In your case, the address of this pthread function gets resolved at link time to 0
(you didn't have -lpthread
on your compile command line), which is where the mov eax, 0x0
comes from. At link time, it's too late to optimize on this knowledge, so the vestigial atomic increment code remains but never executes. This mechanism is described in more detail in this answer.
The code that does execute is the last part of the function, starting at 4009c5
. This code also decrements the reference count, but in a non-atomic way. The check which decides between these two options is probably based on whether the process is multithreaded or not, e.g., whether -lpthread
has been linked. For whatever reason this check, inside __exchange_and_add_dispatch
, is implemented in a way that prevents the compiler from actually removing the atomic half of the branch, even though the fact that it will never be taken is known at some point during the build process (after all, the hard-coded mov eax, 0
got there somehow).
A follow-up question is how I can avoid it?
Well you've already avoided the lock add
part, so if that's what you care about, your good to go. However, you still have a cause for concern:
Copy on write std::string
implementations are not standards compliant due to changes made in C++11, so the question remains why exactly you are getting this COW string behavior even when specifying -std=c++17
.
The problem is most likely distribution related: CentOS 7 by default uses an ancient gcc version < 5 which still uses the non-compliant COW strings. However, you mention that you are using gcc 8.2.1, which by default in a normal install which uses non-COW strings. It seems like if you installed 8.2.1 use the RHEL "devtools" method, you'll get a new gcc which still uses the old ABI and links against the old system libstdc++.
To confirm this, you might want to check the value of _GLIBCXX_USE_CXX11_ABI macro in your test program, and also your libstdc++ version (the version information here might prove useful).
You can avoid by using an OS other than CentOS that doesn't use ancient gcc and glibc version. If you need to stick with CentOS for some reason you'll have to look into if there is a supported way to use newer libstdc++ version on that distribution. You could also consider using a containerization technology to build an executable independent of the library versions of your local host.
1 You can demangle it like so: echo '_ZNSs4_Rep10_M_disposeERKSaIcE' | c++filt
.
2 I'm using gcc-4 era source since I'm guessing that's what you end up using in CentOS 7.