11

I'm working on a runtime library that uses user-level context switching (using Boost::Context), and am having trouble using thread_level variables. Consider the following (reduced) code:

thread_local int* volatile tli;

int main()
{
    tli = new int(1);   // part 1, done by thread 1
    UserLevelContextSwitch();
    int li = *tli;      // part 2, done by thread 2
    cout << li;
}

Since there are two accesses to the thread_local variable, the main function is transformed by the compiler to something along these lines (reversed from assembly):

register int** ptli = &tli; // cache address of thread_local variable
*ptli = new int(1);
UserLevelContextSwitch();
int li = **ptli;
cout << li;

This seems to be a legal optimization, since the value of volatile tli is not being cached in a register. But the address of the volatile tli is in fact being cached, and not read from memory on part 2.

And that's the problem: after the user-level context switch, the thread that did part 1 goes somewhere else. Part 2 is then picked up by some other thread, which gets the previous stack and registers state. But now the thread that's executing part 2 reads the value of the tli that belongs to thread 1.

I'm trying to figure out a way to prevent the compiler from caching the thread-local variable's address, and volatile doesn't go deep enough. Is there any trick (preferably standard, possibly GCC-specific) to prevent the caching of the thread-local variables' addresses?

Eran
  • 21,632
  • 6
  • 56
  • 89
  • 2
    I may be dense here, but what good does `thread_local` buy you if you are doing your own threading? – 500 - Internal Server Error Sep 04 '14 at 19:56
  • @Mgetz, aren't thread-local variables inherently atomic?... Your suggestion might happen to work if the compiler gets over-conservative with atomics, but it shouldn't really have a problem caching the *addresses* of the atomic variables just as it does with the non-atomic variables. – Eran Sep 04 '14 at 19:59
  • C++11's threading model assumes that a thread of execution will start at the top of a function then follow a normal calling sequence - it doesn't support a thread 'starting' in the middle of a function (C++11 1.10/1 "Multi-threaded executions and data races"). That's not to say you won't be able to solve your problem, but I doubt there will be a standard way to do it. Excellent question, though. – Michael Burr Sep 04 '14 at 20:05
  • @500-InternalServerError, my library manages tasks, which can migrate from OS thread to OS thread. I need a way to make some own identity stuff available to each running task. I manage the setup of the identity when a task is mapped to a thread, but due to this issue the identity stuff might be mapped to the wrong thread. Moving the identity stuff around as arguments to each call is undesirable, because that will affect user code. – Eran Sep 04 '14 at 20:05
  • @eran: I understand your issue here, but won't the actual _value_ of the thread local follow the hardware thread (for lack of a better term)? – 500 - Internal Server Error Sep 04 '14 at 20:13
  • @500-InternalServerError, the value is obtained by reading an address, and each thread's variable is located on [a different address](http://coliru.stacked-crooked.com/a/860d6d215219feb2). By moving the addresses around, one thread can easily access TLS variables of another one. – Eran Sep 04 '14 at 20:41
  • `volatile` is basically always wrong. [See this](https://lkml.org/lkml/2007/5/8/372) (most of it is applicable to userspace too). It is also not clear what do you mean by "volatile address". The address of a variable is normally a constant. It is not "read from memory", it just is. The address of `tli` could be 42. What does it mean to read 42 from memory? If there's a meaning, how can the result be different each time? Shouldn't it always be 42? – n. m. could be an AI Sep 04 '14 at 22:03
  • @n.m., using `volatile` with `setjmp`/`longjmp` is common, and if IIRC is the only scenario in which the effect of `volatile` is testable within *standard* code. See [this](http://stackoverflow.com/questions/7996825/why-volatile-works-for-setjmp-longjmp) for more. As for the memory address being changed, that's exactly the problem with thread-local variables: if two threads access one such variable, each "copy" of the variable will have a different address. The addresses are not constants, and are obtained by calling special functions such as `__emutls_get_address` - hence the address caching. – Eran Sep 05 '14 at 06:46
  • The linked rant says one can use `volatile` for *specific* accesses when registers should be flushed (i.e. after `longjmp`, or after your context switch). Basically it says `volatile` is a wrong design choice. It should have been an action-like compiler directive ("flush registers at this point"). Well I guess if you don't need every last drop of performance, you can use `volatile` in the declaration. – n. m. could be an AI Sep 05 '14 at 07:53
  • As for the address being constant, it is a constant in each thread. The compiler is free to implement thread-local variables as e.g. `__threads[__current_thread_id]->thread_local_memory[__variable_offset]`. `__variable_offset` is a constant, `__current_thread_id` is normally too. You want the latter to be in effect volatile. But obtaining the current thread id could be very expensive and the compiler has no reason to do this for every volatile access. Maybe you can fool the compiler by having `&tli` returned by a function in a separate translation unit, with whole-program optimizations off. – n. m. could be an AI Sep 05 '14 at 08:02

1 Answers1

7

There is no way to pair user-level context switches with TLS. Even with atomics and full memory fence, caching address seems legitimate optimization since the thread_local variable is file-scope, static variable which cannot be moved as assumed by the compiler. (though, perhaps some compilers can still be sensitive to the compiler memory barriers like std::atomic_thread_fence and asm volatile ("" : : : "memory");)

uses the same technique as you described to implement "continuation stealing" when a different thread can continue execution after the sync point. And they explicitly discourage usage of TLS in a Cilk program. Instead, they recommend using "hyperobjects" - a special feature of Cilk which substitutes TLS (and also provides serial/deterministic join semantics). See also Cilk developer presentation about thread_local and parallelism.

Also, Windows provides FLS (Fiber Local Storage) as a TLS replacement when Fibers (the same lightweight context switches) are in use.

Anton
  • 6,349
  • 1
  • 25
  • 53
  • Interesting. I dug into the Cilk runtime code a bit, and turns out they do use TLS [internally](https://github.com/iu-parfunc/cilk_releases/blob/b8eb1c0f36231269f185970bf8d3c911c1dd8b1b/runtime/os-unix.c) to get the descriptor of the current task. But the use of TLS is separated from user code by abstractions such as hyperobjects (and, actually, a library). My project lacks compiler support yet, so user code is cluttered with code that should be generated by the compiler. It might hurt performance, but I'll try separating TLS commands from actual use. – Eran Sep 05 '14 at 13:16
  • 1
    If you rely on particular compiler (gcc), I'd look into TLS ABI and try to force the compiler to recalculate the address (put GS into clobbered list? Replace the address by hands?). – Anton Sep 05 '14 at 13:46
  • I was hoping TLS will have some low-level API I could use (intrinsics), but couldn't find any, and the generated assembly doesn't seem easy to duplicate (too many "hardcoded" offsets). I'll start with using accessors to the TL variables, and define them as `noinline`. Hopefully, that will prevent the compiler from omitting calls. – Eran Sep 05 '14 at 14:22
  • there is API but for slow-path. see TLS ABI [here](http://people.redhat.com/drepper/tls.pdf) – Anton Sep 05 '14 at 16:51