Why is thread local storage so slow?

Question

I'm working on a custom mark-release style memory allocator for the D programming language that works by allocating from thread-local regions. It seems that the thread local storage bottleneck is causing a huge (~50%) slowdown in allocating memory from these regions compared to an otherwise identical single threaded version of the code, even after designing my code to have only one TLS lookup per allocation/deallocation. This is based on allocating/freeing memory a large number of times in a loop, and I'm trying to figure out if it's an artifact of my benchmarking method. My understanding is that thread local storage should basically just involve accessing something through an extra layer of indirection, similar to accessing a variable via a pointer. Is this incorrect? How much overhead does thread-local storage typically have?

Note: Although I mention D, I'm also interested in general answers that aren't specific to D, since D's implementation of thread-local storage will likely improve if it is slower than the best implementations.

Are you experiencing this on Windows? I don't know how D implements TLS exactly, but Visual C++ seems to use the FS register, which is quite slow to read from. — user541686, May 10 '13 at 22:04

score 39 · Accepted Answer · answered Feb 03 '09 at 06:06

The speed depends on the TLS implementation.

Yes, you are correct that TLS can be as fast as a pointer lookup. It can even be faster on systems with a memory management unit.

For the pointer lookup you need help from the scheduler though. The scheduler must - on a task switch - update the pointer to the TLS data.

Another fast way to implement TLS is via the Memory Management Unit. Here the TLS is treated like any other data with the exception that TLS variables are allocated in a special segment. The scheduler will - on task switch - map the correct chunk of memory into the address space of the task.

If the scheduler does not support any of these methods, the compiler/library has to do the following:

get current ThreadId
Take a semaphore
Lookup the pointer to the TLS block by the ThreadId (may use a map or so)
Release the semaphore
Return that pointer.

Obviously doing all this for each TLS data access takes a while and may need up to three OS calls: Getting the ThreadId, Take and Release the semaphore.

The semaphore is btw required to make sure no thread reads from the TLS pointer list while another thread is in the middle of spawning a new thread. (and as such allocate a new TLS block and modify the datastructure).

Unfortunately it's not uncommon to see the slow TLS implementation in practice.

The slow version might only be needed one per stack frame as it could be cached in reused (the stack is TLS after a fashion) — BCS, Feb 10 '09 at 19:03
Keeping a thread list in userspace is an utterly horrible way to implement threads... — R.. GitHub STOP HELPING ICE, Mar 27 '11 at 19:08
Any scheme that can produce a threadID, can produce a TLS pointer in an equally short period of time (if not at the same time!), so I don't understand why semaphores are needed. — Ira Baxter, Jul 03 '12 at 04:29
@IraBaxter: because you might need more than one TLS pointer ... one TLS pointer per shared library. (That aside, the semaphores might be implemented in interlocked atomic instructions.) — rwong, Feb 14 '13 at 17:38

score 13 · Answer 2 · answered Apr 13 '11 at 08:56

Thread locals in D are really fast. Here are my tests.

64 bit Ubuntu, core i5, dmd v2.052 Compiler options: dmd -O -release -inline -m64

// this loop takes 0m0.630s
void main(){
    int a; // register allocated
    for( int i=1000*1000*1000; i>0; i-- ){
        a+=9;
    }
}

// this loop takes 0m1.875s
int a; // thread local in D, not static
void main(){
    for( int i=1000*1000*1000; i>0; i-- ){
        a+=9;
    }
}

So we lose only 1.2 seconds of one of CPU's cores per 1000*1000*1000 thread local accesses. Thread locals are accessed using %fs register - so there is only a couple of processor commands involved:

Disassembling with objdump -d:

- this is local variable in %ecx register (loop counter in %eax):
   8:   31 c9                   xor    %ecx,%ecx
   a:   b8 00 ca 9a 3b          mov    $0x3b9aca00,%eax
   f:   83 c1 09                add    $0x9,%ecx
  12:   ff c8                   dec    %eax
  14:   85 c0                   test   %eax,%eax
  16:   75 f7                   jne    f <_Dmain+0xf>

- this is thread local, %fs register is used for indirection, %edx is loop counter:
   6:   ba 00 ca 9a 3b          mov    $0x3b9aca00,%edx
   b:   64 48 8b 04 25 00 00    mov    %fs:0x0,%rax
  12:   00 00 
  14:   48 8b 0d 00 00 00 00    mov    0x0(%rip),%rcx        # 1b <_Dmain+0x1b>
  1b:   83 04 08 09             addl   $0x9,(%rax,%rcx,1)
  1f:   ff ca                   dec    %edx
  21:   85 d2                   test   %edx,%edx
  23:   75 e6                   jne    b <_Dmain+0xb>

Maybe compiler could be even more clever and cache thread local before loop to a register and return it to thread local at the end (it's interesting to compare with gdc compiler), but even now matters are very good IMHO.

score 9 · Answer 3 · edited Jan 07 '12 at 01:03

One needs to be very careful in interpreting benchmark results. For example, a recent thread in the D newsgroups concluded from a benchmark that dmd's code generation was causing a major slowdown in a loop that did arithmetic, but in actuality the time spent was dominated by the runtime helper function that did long division. The compiler's code generation had nothing to do with the slowdown.

To see what kind of code is generated for tls, compile and obj2asm this code:

__thread int x;
int foo() { return x; }

TLS is implemented very differently on Windows than on Linux, and will be very different again on OSX. But, in all cases, it will be many more instructions than a simple load of a static memory location. TLS is always going to be slow relative to simple access. Accessing TLS globals in a tight loop is going to be slow, too. Try caching the TLS value in a temporary instead.

I wrote some thread pool allocation code years ago, and cached the TLS handle to the pool, which worked well.

Can you expand this comment with respect to the top 'accepted' answer? The claim there is that it can be very fast, as-fast or faster than an indirection if the OS/MMU are involved. Where does D stand in terms of such an implementation? — Manu Evans, Apr 05 '19 at 17:35

score 5 · Answer 4 · answered Jan 07 '12 at 23:35

I've designed multi-taskers for embedded systems, and conceptually the key requirement for thread-local storage is having the context switch method save/restore a pointer to thread-local storage along with the CPU registers and whatever else it's saving/restoring. For embedded systems which will always be running the same set of code once they've started up, it's easiest to simply save/restore one pointer, which points to a fixed-format block for each thread. Nice, clean, easy, and efficient.

Such an approach works well if one doesn't mind having space for every thread-local variable allocated within every thread--even those that never actually use it--and if everything that's going to be within the thread-local storage block can be defined as a single struct. In that scenario, accesses to thread-local variables can be almost as fast as access to other variables, the only difference being an extra pointer dereference. Unfortunately, many PC applications require something more complicated.

On some frameworks for the PC, a thread will only have space allocated for thread-static variables if a module that uses those variables has been run on that thread. While this can sometimes be advantageous, it means that different threads will often have their local storage laid out differently. Consequently, it may be necessary for the threads to have some sort of searchable index of where their variables are located, and to direct all accesses to those variables through that index.

I would expect that if the framework allocates a small amount of fixed-format storage, it may be helpful to keep a cache of the last 1-3 thread-local variables accessed, since in many scenarios even a single-item cache could offer a pretty high hit rate.

sam · Answer 5 · 2010-03-20T17:44:23.723

If you can't use compiler TLS support, you can manage TLS yourself. I built a wrapper template for C++, so it is easy to replace an underlying implementation. In this example, i've implemented it for Win32. Note: Since you cannot obtain an unlimited number of TLS indices per process (at least under Win32), you should point to heap blocks large enough to hold all thread specific data. This way you have a minimum number of TLS indices and related queries. In the "best case", you'd have just 1 TLS pointer pointing to one private heap block per thread.

In a nutshell: Don't point to single objects, instead point to thread specific, heap memory/containers holding object pointers to achieve better performance.

Don't forget to free memory if it isn't used again. I do this by wrapping a thread into a class (like Java does) and handle TLS by constructor and destructor. Furthermore, i store frequently used data like thread handles and ID's as class members.

usage:

for type*: tl_ptr<type>

for const type*: tl_ptr<const type>

for type* const: const tl_ptr<type>

const type* const: const tl_ptr<const type>

template<typename T>
class tl_ptr {
protected:
    DWORD index;
public:
    tl_ptr(void) : index(TlsAlloc()){
        assert(index != TLS_OUT_OF_INDEXES);
        set(NULL);
    }
    void set(T* ptr){
        TlsSetValue(index,(LPVOID) ptr);
    }
    T* get(void)const {
        return (T*) TlsGetValue(index);
    }
    tl_ptr& operator=(T* ptr){
        set(ptr);
        return *this;
    }
    tl_ptr& operator=(const tl_ptr& other){
        set(other.get());
        return *this;
    }
    T& operator*(void)const{
        return *get();
    }
    T* operator->(void)const{
        return get();
    }
    ~tl_ptr(){
        TlsFree(index);
    }
};

This question is about D, not C++, and you don't address any of the questions asked by the OP. — , Mar 20 '10 at 11:19
True, I added some more about the concept i've been following so far (imho, the language doesn't really matter). — sam, Mar 20 '10 at 11:34
In Sam's support, the OP specifically asked for solutions other than D's solution, too. — Ira Baxter, Mar 04 '12 at 19:30

score 2 · Answer 6 · edited Dec 24 '11 at 16:40

2

We have seen similar performance issues from TLS (on Windows). We rely on it for certain critical operations inside our product's "kernel'. After some effort I decided to try and improve on this.

I'm pleased to say that we now have a small API that offers > 50% reduction in CPU time for an equivalent operation when the callin thread doesn't "know" its thread-id and > 65% reduction if calling thread has already obtained its thread-id (perhaps for some other earlier processing step).

The new function ( get_thread_private_ptr() ) always returns a pointer to a struct we use internally to hold all sorts, so we only need one per thread.

All in all I think the Win32 TLS support is poorly crafted really.

edited Dec 24 '11 at 16:40

answered Dec 18 '09 at 12:45

Hugh

21
1

3

Not going to down vote, but this anecdotal evidence is not too helpful. You didn't mention what Windows version you tested on and that could make a great difference in the numbers. Neither you said using what technique you improved upon that with your get_thread_private_ptr method. Other things that could contribute to the performance is what language/compiler you used or if you just used the TLS API (TlsAlloc, TlsGetValue, TlsSetValue, and TlsFree). They do work differently and have different performance characteristics. – Filip Navara Mar 20 '18 at 20:57

Why is thread local storage so slow?

6 Answers6

usage:

Linked