Why is thread local storage not implemented with page table mappings?

Question

I was hoping to use the C++11 thread_local keyword for a per-thread boolean flag that is going to be accessed very frequently.

However, most compilers seem to implemented thread local storage with a table that maps integer IDs (slots) to the variable's address on the current thread. This lookup would happen inside a performance-critical code path, so I have some concerns about its performance.

The way I would have expected thread local storage to be implemented is by allocating virtual memory ranges that are backed by different physical pages depending on the thread. That way, accessing the flag would be the same cost as any other memory access, since the MMU takes care of the mapping.

Why do none of the mainstream compilers take advantage of page table mappings in this way?

I suppose I can implement my own "thread-specific page" with mmap on Linux and VirtualAlloc on Win32, but this seems like a pretty common use-case. If anyone knows of existing or better solutions, please point me to them.

I've also considered storing an std::atomic<std::thread::id> inside each object to represent the active thread, but profiling shows that the check for std::this_thread::get_id() == active_thread is quite expensive.

If you are afraid about performance of the look-up (i.e. profiling proved so) you can cache the address using `auto& x = my_thread_local;`. — , Oct 18 '14 at 08:52
Caching the address is unfortunately not going to work for my use-case, as the ``thread_local`` variable is actually a flag indicating "am I the active thread for this object?" and there are many objects on which operations are performed in an asynchronous manner. So these lookups happen in deeply nested contexts all over the call-stack. — troniacl, Oct 18 '14 at 08:57
Oh.. that sucks. Luckily, I don't use TLS but still, I wonder why it was designed that way? Sounds moronic to me:( — Martin James, Oct 18 '14 at 09:01
All threads of the same process normally use the same mapping tables, so you cannot use mmap for TLS. — n. m. could be an AI, Oct 18 '14 at 09:02
One of the benefits of threads is that you don't have to change mapping tables when you switch threads. If you're using thread local storage a lot, I'd strongly suggest rethinking whatever design made that seem like a good idea. (Since each thread is a contiguous flow of control, you can just copy the boolean somewhere non-thread-specific, operate on it at full speed, and copy it back when you're done.) — David Schwartz, Oct 18 '14 at 09:29
@DavidSchwartz For context: This is in the implementation of a synchronization primitive that will be used for fine-grained parallelism in an asynchronous server. — troniacl, Oct 18 '14 at 09:49
Just wondering, but there is not just one flag per thread but there are objects where every object has such a flag. Since you mention that using the thread ID works instead, I wonder firstly who sets these thread-specific flags and secondly how you sync access to them. Maybe taking a step back and describing the underlying problem is a better approach than asking how to implement a specific (and possibly flawed) solution to it... — Ulrich Eckhardt, Oct 18 '14 at 10:00
Maybe consider changing the thread local "am I the active thread for this object" flag to a non-thread local global variable that holds a thread ID indicating "which thread is the active thread for this object" that is updated atomically? Then reads of the active thread ID for the object will just be plain old reads from the data segment/section. — Michael Burr, Oct 18 '14 at 10:02
I agree with Ulrich Eckhardt. Explain what the boolean does -- there almost certainly is a better way. — David Schwartz, Oct 18 '14 at 10:02
This should also serve as a yet another example of why global variables are bad. Try more functional or OOP architecture without globals... — hyde, Oct 18 '14 at 10:05
@UlrichEckhardt When an operation should be performed on an object, its ``active_thread`` ID is first atomically compared to the current thread's ID. If they match, no synchronization is necessary because the thread already acquired exclusivity on that object. Otherwise, a work task is pushed into a producer-side wait-free queue which will be processed by the active thread before it relinquishes control of the object. — troniacl, Oct 18 '14 at 10:33
@MichaelBurr It's a per-object flag, so it cannot be changed into a global. Also, the atomic thread ID solution does not use TLS. — troniacl, Oct 18 '14 at 10:35
@hyde Not sure where you see globals, unless you are referring to Michael Burr's comment. — troniacl, Oct 18 '14 at 10:36
@troniacl "Thread local storage" implies static storage duration, does it not? So if not necessarily a "global variable", then at least "global state", leading to side effects and re-entrancy problems even in the context of single thread. — hyde, Oct 18 '14 at 11:15
@hyde I see what you mean, and maybe I mudded things a bit with the ``thread_local`` specifier, since that can only be applied to global/static variables. That is actually yet another problem with ``thread_local``, because in my case the flags need to be per-object. Fortunately ``pthread_key_create`` and its friends can be used to get non-global thread-local storage, but at the cost of internally looking the address up in a table. — troniacl, Oct 18 '14 at 11:27
@troniacl: if it's a per-object flag then how can it be thread local? And if it's a per object flag already, do what Ulrich Eckhardt suggested and make the flag a thread ID. — Michael Burr, Oct 18 '14 at 17:04

score 6 · Answer 1 · edited Jul 24 '17 at 16:51

On Linux/x86-64 thread local storage is implemented thru a special segment register %fs (per x86-64 ABI page 23...)

So the following code (I'm using C + GCC extension __thread syntax, but it is the same as C++11 thread_local)

__thread int x;
int f(void) { return x; }

is compiled (with gcc -O -fverbose-asm -S) into:

         .text
 .Ltext0:
         .globl  f
         .type   f, @function
 f:
 .LFB0:
         .file 1 "tl.c"
         .loc 1 3 0
         .cfi_startproc
         .loc 1 3 0
         movl    %fs:x@tpoff, %eax       # x,
         ret
         .cfi_endproc
 .LFE0:
         .size   f, .-f
         .globl  x
         .section        .tbss,"awT",@nobits
         .align 4
         .type   x, @object
         .size   x, 4
 x:
         .zero   4

Therefore, contrarily to your fears, access to TLS is really quick on Linux/x86-64. It is not exactly implemented as a table (instead the kernel & runtime manage the %fs segment register to point to a thread-specific memory zone, and the compiler & linker manage the offset there). However, old pthread_getspecific indeed went thru a table, but is nearly useless once you have TLS.

BTW, by definition, all threads in the same process share the same address space in virtual memory, since a process has its own single address space. (see /proc/self/maps etc... see proc(5) for more about /proc/, and also mmap(2); the C++11 thread library is based on pthreads which are implemented using clone(2)). So "thread-specific memory mapping" is a contradiction: once a task (the thing which is run by the kernel scheduler) has its own address space, it is called a process (not a thread). The defining characteristic of threads in the same process is to share a common address space (and some other entities, like file descriptors).

As long as you are not compiling with `-fPIC` (or equivalent) and your `thread_local` object does not have a non-trivial constructor or destructor (gcc doesn't optimize properly), TLS has roughly the same speed as a plain global variable. — Marc Glisse, Oct 18 '14 at 10:44
Excellent answer! And a big +1 for the last paragraph; I don't know what OP meant by "thread-specific page mapping" either. — Quuxplusone, Jul 05 '17 at 19:46
That is true for application/static library, but not true for shared libraries. — Bogdan Mart, Dec 21 '21 at 15:20

score 4 · Answer 2 · answered Oct 28 '15 at 11:06

The suggestion doesn't work, because it would prevent other threads from accessing your thread_local variables via a pointer. Those threads would end up accessing their own copy of that variable.

Say for example that you have a main thread and 100 worker threads. The worker_threads pass a pointer to their own thread_local variable back to the main thread. The main thread now has 100 pointers to those 100 variables. If the TLS memory was page-table mapped as suggested, the main thread would have 100 identical pointers to a single, uninitialized variable in the TLS of the main thread - certainly not what was intended!

score 2 · Answer 3 · answered Oct 18 '14 at 09:15

2

Memory-mappings are not per-thread but per-process. All threads would share the same mapping.

The kernel could offer per-thread mappings but it presently does not.

answered Oct 18 '14 at 09:15

usr

168,620
35
240
369

1

No, the kernel could not offer per-thread mappings, since by definition memory mapping is specific to & characteristic of processes, not threads. – Basile Starynkevitch Oct 18 '14 at 09:22
3

@BasileStarynkevitch: That seems a rather odd definition. Do you have an authoritative source for that? Either POSIX or any OS documentation? – MSalters Oct 18 '14 at 09:27
2

Each CPU core has a register that determines the set of page tables loaded. That register can be set to a different value per thread by the OS. The distinction of threads and processes is not one that the hardware knows about. It is an OS concept. You are referring to pthreads documentation. pthreads could be defined differently. – usr Oct 18 '14 at 09:32
I agree that threads & processes are a software only concept. BTW, the Linux kernel scheduler is scheduling tasks, which can be a single-threaded process, or a thread (or some kernel thread like `kswapd`) – Basile Starynkevitch Oct 18 '14 at 09:35

score 2 · Answer 4 · answered Oct 18 '14 at 10:19

Main-stream operating systems like Linux, OSX, Windows make page-mapping a per-process property, not per-thread. There is a very good reason for that, the page mapping tables are stored in RAM and reading it to calculate the effective physical address would be excessively expensive if this has to be done for every instruction.

So the processor doesn't, it keeps a copy of the recently used mapping table entries in fast memory that's close to the execution core. Called the TLB cache.

Invalidating the TLB cache is very expensive, it has to be reloaded from RAM with low odds that the data is available in one of the memory caches. The processor can be stalled for thousands of cycles when this needs to happen.

So your proposed scheme is in fact likely to be very inefficient, assuming an operating system would support it, using an indexed lookup is cheaper. Processors are very good at simple math, happens at gigahertz, accessing memory happens in megahertz.

I'm using one thread per core, with each thread being pinned to its assigned core. I would allocate a large enough number of "thread-local pages" at startup, so there is no reason why the TLB would have to be cleared during normal operation. — troniacl, Oct 18 '14 at 10:28
Well, it is academic, your OS doesn't support it. I explained why, do keep in mind that multicore processors were very rare in the early 1990s :) — Hans Passant, Oct 18 '14 at 10:36
On some platforms, the TLB can't even fetch through a cache(usually because hardware consistency isn't implemented for the TLB). — rsaxvc, Jul 02 '17 at 14:19

score 0 · Answer 5 · answered Oct 18 '14 at 12:52

0

You are using C++. Have a thread object per thread, with the working procedure of the thread and all/most functions called by it being member functions of that object. Then you can have thread ID or any other thread-specific data as member variables.

answered Oct 18 '14 at 12:52

n. m. could be an AI

112,515
14
128
243

score 0 · Answer 6 · answered Jul 02 '17 at 14:17

One contemporary concern is hardware constraints(though, I'm sure this predates the situations below).

On SPARC T5 processors, each hardware thread has its own MMU, but shares a TLB with up to seven sibling threads on the same core, and that TLB can get thrashed pretty hard.

On MIPS different memory mappings for threads can force them to be serialized to a single virtual thread execution context. This is because hardware thread contexts share an MMU. The kernel already can't run multiple processes on neighboring thread contexts, and separate memory mappings per thread would have the same limitation.

Why is thread local storage not implemented with page table mappings?

6 Answers6

Linked