I have code that is using thread_local buffers and is similar to this:
int func() {
thread_local std::vector<int> buffer;
buffer.resize(0);
for (int i = 0; i < 10000; i++) {
buffer.push_back(i);
}
return processing(buffer);
}
While profiling my code, I noticed that gcc places a call to _tls_get_addr()
inside the loop's body, in order to access buffer
. The disassembly from godbolt of the loop's body looks like this:
lea rbx, -20[rbp]
data16 lea rdi, f()::buffer@tlsgd[rip]
.value 0x6666
rex64
call __tls_get_addr@PLT ; <- This call!
mov rsi, rbx
mov rdi, rax
call std::vector<int, std::allocator<int> >::push_back(int const&)@PLT
These calls are slowing down the loop by a lot. I can manually use this version with references:
int func() {
static thread_local std::vector<int> _buffer;
auto& buffer = _buffer;
buffer.resize(0);
for (int i = 0; i < 10000; i++) {
buffer.push_back(i);
}
return processing(buffer);
}
Which gets rid from the calls to _tls_get_addr()
and solves the slowness. It seems really silly that I have to do it manually for each such variable. Why won't gcc cache the result of _tls_get_addr()
automatically? Clang seems to be able to do it with -O3
, so it suggests this a legal optimization that gcc haven't implemented. Is that so?
Changing the tls model to init-exec also eliminates these calls, but my library is usually dynamically loaded as a Python extension so my understanding is this is not possible in that case.