What is the mechanism behind Gnu C++ 4.8 thread_local implementation and what exact "runtime penalty" is there?

Question

The gcc 4.8.0 added the implementation of thread_local from the C++11 Standard. The Changes state that there may be a "runtime penally":

G++ now implements the C++11 thread_local keyword; [...] Unfortunately, this support requires a run-time penalty for references to non-function-local thread_local variables defined in a different translation unit even if they don't need dynamic initialization, [...].

If the programmer can be sure that no use of the variable in a non-defining TU needs to trigger dynamic initialization (either because the variable is statically initialized, or a use of the variable in the defining TU will be executed before any uses in another TU), they can avoid this overhead with the -fno-extern-tls-init option.

Can anyone explain to me what G++ does for thread_local global variables?

What is the general mechanism?
What induces the overhead?
How much overhead is involved per access? A pointer indirection? A costly lock?
Under what circumstances is there no overhead, exactly?

From the changes note I assume for example that this would not have overhead:

thread_local Data data { 1000 };

void worker() {
    for(auto &elem : data)
        elem.calulcate();
}

because data is in the same translation unit?

And how does this change if worker and data are in different translation units? Is this an example for that?

// module.cpp

void worker();

thread_local Data data { 1000 };

void start() {
    worker();
}

// main.cpp

extern thread_local Data data; // correct decl?

void worker() {
    for(auto &elem : data)
        elem.calulcate();
}

Does now the use of data in worker induce an overhead? Is that still the case, even it it was start that kicked off worker?

What is the mechanism behind Gnu C++ 4.8 thread_local implementation and what exact "runtime penalty" is there?

0 Answers0