The Cost of thread_local

Question

Now that C++ is adding thread_local storage as a language feature, I'm wondering a few things:

What is the cost of thead_local likely to be?
- In memory?
- For read and write operations?
Associated with that: how do Operating Systems usually implement this? It would seem like anything declared thread_local would have to be given thread-specific storage space for each thread created.

The biggest cost is in maintainability of the code. – David Heffernan Dec 13 '11 at 13:29 — David Heffernan, Dec 13 '11 at 13:29

score 16 · Accepted Answer · answered Dec 13 '11 at 13:24

Storage space: size of the variable * number of threads, or possibly (sizeof(var) + sizeof(var*)) * number of threads.

There are two basic ways of implementing thread-local storage:

Using some sort of system call that gets information about the current kernel thread. Sloooow.
Using some pointer, probably in a processor register, that is set properly at every thread context switch by the kernel - at the same time as all the other registers. Cheap.

On intel platforms, variant 2 is usually implemented via some segment register (FS or GS, I don't remember). Both GCC and MSVC support this. Access times are therefore about as fast as for global variables.

It is also possible, but I haven't seen it yet in practice, for this to be implemented via existing library functions like pthread_getspecific. Performance would then be like 1. or 2., plus library call overhead. Keep in mind that variant 2. + library call overhead is still a lot faster than a kernel call.

It should be noted that the OS executable loader and dynamic linker (for shared libraries/DLLs) need specific support for Variant 2. Getting TLS to work via the segment register is actaully a lot of work. But it is well worth it since the overhead of using the TLS variable is then negligible compared to a normal global. — edA-qa mort-ora-y, Dec 13 '11 at 15:09

score 12 · Answer 2 · answered Dec 13 '11 at 14:23

A description for how it works on Linux by Uli Drepper (maintainer of glibc) can be found here: www.akkadia.org/drepper/tls.pdf

The requirement to handle dynamically loaded modules etc. make the entire mechanism a bit convoluted, which perhaps partly explains why the document weights in at 79 pages (!).

Memory-usage-wise, each per-thread variable obviously needs it's own per-thread memory (although in some cases this can be done lazily such that the space is allocated only once the variable is first accessed), and then there's some extra datastructures that are needed for offset tables etc.

Performance-wise, the extra cost to access a TLS variable mostly revolves around retrieving the address of the variable. On x86 Linux the GS register is used as a start to get a thread id, on x86-64 FS. Usually there is a few pointer dereferences, and a function call (__tls_get_addr) for dynamically loaded code. There's also the cost that creating a new thread is slower because the implementation needs to allocate space and possibly initialize all the TLS vars (if not done lazily).

TLS is nice for easily making some old thread-unsafe code patterns thread-safe (think errno), but for new code designed from the start for a multi-threaded world it's very seldom needed.

I disagree with that _very seldom needed_ comment. TLS is a very simple, clean, and quick way to reduce contention between threads and improve performance by avoiding repeated lookups in a thread. — edA-qa mort-ora-y, Dec 13 '11 at 15:05

The Cost of thread_local

2 Answers2

Linked