As long as you avoid false sharing, you're fine. i.e. make sure that the static data used by one thread isn't in the same cache line as static data used by another thread. You could look for this by checking the memory order machine-clears perf event.
If you find that your program has some false-sharing, you could just re-arrange the order of declarations (since compilers usually tend to store things in the order they're declared), or do something about it with linker sections to choose how things are grouped in static storage. A struct or array would also give you guarantees about memory layout.
TL;DR: avoid putting two variables in the same cache line (often 64B) if those variables will be used by different threads. Of course, group things together that are modified at the same time from the same thread.
Thread-local variables solve a different problem. They let the same function access a different static variable depending on which thread called it. This is an alternative to passing around pointers.
They're still stored in memory just like other static
/ global variables. You can be sure there's no false sharing, but there are cheaper ways to avoid that.
The difference between thread-local vars and "normal" globals is in how they're addressed. Instead of just accessing them through an absolute address, it's an offset from a thread-local storage block.
On x86, this is done with segment-override prefixes. e.g. mov rax, QWORD PTR fs:0x28
loads from byte 0x28 inside the thread-local storage block (since each thread's fs
segment register is loaded with the offset of its own TLS block).
So TLS isn't free. Don't use it if you don't need it. It can be cheaper than passing pointers around, though.
There's no way to let the hardware skip cache coherency checks, because hardware doesn't have any notion of TLS. There are just stores and loads to/from memory, and the ordering guarantees provided by the ISA. Since TLS is just a trick for getting the same function for use different addresses for different callers, a software bug in implementing TLS could result in stores to the same address. The hardware doesn't let buggy software break its cache coherency in this way, since it would potentially break privilege separation.
On weakly-ordered architectures, memory_order_consume
is (in theory) a way to arrange inter-thread data dependencies such that only writes to the shared data have to be waited for by other threads, not writes to thread-private data.
However, this is too hard for compilers to safely and reliably get right, so they currently implement mo_consume as the stronger mo_acquire. I wrote a really long and rambling answer a while ago with a bunch of links to memory-ordering stuff, and a mention of C++11 memory_order_consume.
It's so hard to standardize because different architectures have different rules for what operations carry a dependency. I assume some code-bases have some hand-written asm that takes advantage of dependency ordering. AFAIK, hand-written asm is the only way to take advantage of dependency ordering to avoid memory barrier instructions on a weakly-ordered ISA. (e.g. in a producer-consumer model, or in lockless algorithms that need more than just unordered atomic stores.)