4

I have code that is using thread_local buffers and is similar to this:

int func() {
    thread_local std::vector<int> buffer;

    buffer.resize(0);
    for (int i = 0; i < 10000; i++) {
        buffer.push_back(i);
    }

    return processing(buffer);
}

While profiling my code, I noticed that gcc places a call to _tls_get_addr() inside the loop's body, in order to access buffer. The disassembly from godbolt of the loop's body looks like this:

        lea     rbx, -20[rbp]
        data16  lea rdi, f()::buffer@tlsgd[rip]
        .value  0x6666
        rex64
        call    __tls_get_addr@PLT  ; <- This call!
        mov     rsi, rbx
        mov     rdi, rax
        call    std::vector<int, std::allocator<int> >::push_back(int const&)@PLT

These calls are slowing down the loop by a lot. I can manually use this version with references:

int func() {
    static thread_local std::vector<int> _buffer;
    auto& buffer = _buffer;

    buffer.resize(0);
    for (int i = 0; i < 10000; i++) {
        buffer.push_back(i);
    }

    return processing(buffer);
}

Which gets rid from the calls to _tls_get_addr() and solves the slowness. It seems really silly that I have to do it manually for each such variable. Why won't gcc cache the result of _tls_get_addr() automatically? Clang seems to be able to do it with -O3, so it suggests this a legal optimization that gcc haven't implemented. Is that so?

Changing the tls model to init-exec also eliminates these calls, but my library is usually dynamically loaded as a Python extension so my understanding is this is not possible in that case.

unddoch
  • 5,790
  • 1
  • 24
  • 37

0 Answers0