31

How is __thread in gcc implemented? Is it simply a wrapper over pthread_getspecific and pthread_setspecific?

With my program that uses the posix API for TLS, I'm kind of disappointed now seeing that 30% of my program runtime is spent on pthread_getspecific. I called it on the entry of each function call that needs the resource. The compiler doesn't seem to optimize out pthread_getspecific after inlining optimization. So after the functions are inlined the code is basically searching for the correct TLS pointer again and again to get the same pointer returned.

Will __thread help me in this situation? I know that there is thread_local in C11, but the gcc I have doesn't support it yet. (But now I see that my gcc does support _Thread_local just not the macro.)

I know I can simply test it and see. But I have to go somewhere else now, and I'd like to know better on a feature before I attempt a quite big rewrite.

red0ct
  • 4,840
  • 3
  • 17
  • 44
  • 1
    `__thread` is implemented differently on different platforms, on some (you didn't tell us which one you are programming for), it might be implemented with `pthread_getspecific`. – fuz Aug 27 '15 at 10:36
  • Please give us more information! I'd really like to solve your problem but right now I don't know enough about what platform you use / how you compile your code to be able to give you an answer as to how to make thread local storage perform better. – fuz Aug 27 '15 at 10:42

2 Answers2

19

Recent GCC, e.g. GCC 5 do support C11 and its thread_local (if compiling with e.g. gcc -std=c11). As FUZxxl commented, you could use (instead of C11 thread_local) the __thread qualifier supported by older GCC versions. Read about Thread Local Storage.

pthread_getspecific is indeed quite slow (it is in the POSIX library, so is not provided by GCC but e.g. by GNU glibc or musl-libc) since it involves a function call. Using thread_local variables will very probably be faster.

Look into the source code of MUSL's thread/pthread_getspecific.c file for an example of implementation. Read this answer to a related question.

And _thread & thread_local are (often) not magically translated to calls to pthread_getspecific. They usually involve some specific address mode and/or register (details are implementation specific, related to the ABI; on Linux, I guess that since x86-64 has more registers & address modes, its implementation of TLS is faster than on i386), with help from the compiler, the linker and the runtime system. It could happen on the contrary that some implementations of pthread_getspecific are using some internal thread_local variables (in your implementation of POSIX threads).

As an example, compiling the following code

#include <pthread.h>

const extern pthread_key_t key;

__thread int data;

int
get_data (void) {
  return data;
}

int
get_by_key (void) {
  return *(int*) (pthread_getspecific (key));
}

using GCC 5.2 (on Debian/Sid) with gcc -m32 -S -O2 -fverbose-asm gives the following code for get_data using TLS:

  .type get_data, @function
get_data:
.LFB3:
  .cfi_startproc
  movl  %gs:data@ntpoff, %eax   # data,
  ret
.cfi_endproc

and the following code of get_by_key with an explicit call to pthread_getspecific:

get_by_key:
 .LFB4:
  .cfi_startproc
  subl  $24, %esp   #,
  .cfi_def_cfa_offset 28
  pushl key # key
  .cfi_def_cfa_offset 32
  call  pthread_getspecific #
  movl  (%eax), %eax    # MEM[(int *)_4], MEM[(int *)_4]
  addl  $28, %esp   #,
  .cfi_def_cfa_offset 4
  ret
  .cfi_endproc

Hence using TLS with __thread (or thread_local in C11) should probably be faster than using pthread_getspecific (avoiding the overhead of a call).

Notice that thread_local is a convenience macro defined in <threads.h> (a C11 standard header).

Community
  • 1
  • 1
Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
  • Does pthread_setspecific do more extra work than the built-in TLS? –  Aug 27 '15 at 10:28
  • `__thread` is a pre-C11 extension of gcc that has the same semantic as C11's `_Thread_local`, in fact it guarantees a bit more than `_Thread_local`. `pthread_getspecific` does not necessarily involve a function call, it can be implemented as a macro. – fuz Aug 27 '15 at 10:30
  • @FUZxxl: it could be implemented by a macro (but I guess that the standard requires that you might use it thru a function pointer), but it usually is not implemented as a macro – Basile Starynkevitch Aug 27 '15 at 10:31
  • @BasileStarynkevitch Standard says it's allowed to be macro, and it's weird that the glibc doesn't implement it as such. – fuz Aug 27 '15 at 10:34
  • @xiver77 It is not specified how `__thread` and `_Thread_local` (i.e. built-in TLS) are implemented. An implementation could very well use `pthread_getspecific` and `pthread_setspecific` to implement it, although thats not the case on usual UNIX-like operating systems. – fuz Aug 27 '15 at 10:39
  • 1
    TLS is implemented both on i386 and amd64 Linux with a segment register (`%fs` for i386, `%gs` for amd64). The speed difference is negligible. – fuz Aug 27 '15 at 10:49
4

gcc's __thread has exactly the same semantic as C11's _Thread_local. You don't tell us what platform you are programming for as the implementation details vary between platforms. For example, on x86 Linux, gcc should compile access to thread local variables as memory instructions with a %fs segment prefix instead of invoking pthread_getspecific.

fuz
  • 88,405
  • 25
  • 200
  • 352
  • I'm using an intel cpu. So you mean gcc uses a special register like the stack pointer register but dedicated to the TLS? Does pthread_getspecific do the same thing? –  Aug 27 '15 at 10:26
  • @xiver77 “i'm using an intel cpu” is not enough information. What operating system and architecture are you programming for? Intel makes CPUs with many different architectures. On i386 platforms were the ABI supports this, the `%fp` segment register is set to a non-zero base address that points to the thread's thread-local data. I can't tell you if gcc can do this on your platform as you don't give me enough information. Could you also give me the version of gcc, the invocation of gcc and assembly output (use the `-S` switch)? – fuz Aug 27 '15 at 10:28
  • Sorry for a late reply. My platform is Ubuntu 15.10 i386 gcc 4.9.2. I'll also check and see the assembly output for `__thread` right now. –  Aug 27 '15 at 10:46
  • @xiver77 How do you invoke gcc? On i386 Linux, gcc should compile access to `__thread` variables without invoking `pthread_getspecific`. Either another part of your code invokes `pthread_getspecific` or something weird happens. – fuz Aug 27 '15 at 10:49
  • @xiver77 This assembly does not invoke `pthread_getspecific` at all. I guess the calls come from somewhere else. – fuz Aug 27 '15 at 10:54
  • Also, it's funny that the compiler generates `%gs` relative accesses whereas it should generate `%fs` relative accesses. How do you invoke gcc? – fuz Aug 27 '15 at 10:55
  • I think it's because I've installed a 32bit Linux. BTW you may have misunderstood my question a little. I did explicitly call `pthread_getspecific` in my original program. That's why my second paragraph is there. I'll edit my question to prevent confusion. –  Aug 27 '15 at 10:57
  • @xiver77 Ah, I see. Sorry for that. In this case, I can gladly say: yes, `__thread` will give you a decent performance increase. – fuz Aug 27 '15 at 10:59