Can I avoid cache consistency checks by declaring variables as thread-local?

Question

I'm reading about how CPUs maintain the consistency of their caches in a multithreaded application. A write in one core's cache labels it as dirty and all other cores must be careful to not read that segment from main memory, because the main memory copy is not up to date.

Many of the applications I write work like an actor system, with mutability limited to local variables on a single thread. I usually don't label them as "thread local" unless I have a semantics reason for doing so.

However, am I missing an optimization opportunity? Does explicitly labeling a variable as thread local, as opposed to just using it that way, inform the hardware that it doesn't have to check consistency because that variable will never be visible to the other threads, even in principle?

Edit: as a higher-level way of expressing the same thing, should I expect performance gains by using a formal actor system, like Akka, instead of just adhering to the actor paradigm in my classes? A formal actor system adds strictness, the ability to scale across computers, and presumably some overhead, but does it also help low-level details like letting threads skip consistency checks on cached data that is known to be non-shared?

Does it do so by labeling the data "thread local"?

score 1 · Answer 1 · answered May 28 '16 at 14:14

1

At least in java all your method-local variable are non-shared and thus confined to only one thread. So there is no reason to mark them in any special way.

answered May 28 '16 at 14:14

Nikem

5,716
3
32
59

That's good to know. In any language, local variables in a function should be non-shared. But what about data that persists across function calls, like the members of a class? – Jim Pivarski May 28 '16 at 14:23
Well, in general if you don't do anything special, like `volatile` or `syncrhonized` or anything else, then you get the fastest possible access in java. Because then neither compiler not JVM don't do anything special. – Nikem May 28 '16 at 14:25
1

Only of you want something special, like atomicity or inter-thread visibility, then you ask for this and then you pay in performance. – Nikem May 28 '16 at 14:26

score 1 · Accepted Answer · edited May 23 '17 at 11:58

As long as you avoid false sharing, you're fine. i.e. make sure that the static data used by one thread isn't in the same cache line as static data used by another thread. You could look for this by checking the memory order machine-clears perf event.

If you find that your program has some false-sharing, you could just re-arrange the order of declarations (since compilers usually tend to store things in the order they're declared), or do something about it with linker sections to choose how things are grouped in static storage. A struct or array would also give you guarantees about memory layout.

TL;DR: avoid putting two variables in the same cache line (often 64B) if those variables will be used by different threads. Of course, group things together that are modified at the same time from the same thread.

Thread-local variables solve a different problem. They let the same function access a different static variable depending on which thread called it. This is an alternative to passing around pointers.

They're still stored in memory just like other static / global variables. You can be sure there's no false sharing, but there are cheaper ways to avoid that.

The difference between thread-local vars and "normal" globals is in how they're addressed. Instead of just accessing them through an absolute address, it's an offset from a thread-local storage block.

On x86, this is done with segment-override prefixes. e.g. mov rax, QWORD PTR fs:0x28 loads from byte 0x28 inside the thread-local storage block (since each thread's fs segment register is loaded with the offset of its own TLS block).

So TLS isn't free. Don't use it if you don't need it. It can be cheaper than passing pointers around, though.

There's no way to let the hardware skip cache coherency checks, because hardware doesn't have any notion of TLS. There are just stores and loads to/from memory, and the ordering guarantees provided by the ISA. Since TLS is just a trick for getting the same function for use different addresses for different callers, a software bug in implementing TLS could result in stores to the same address. The hardware doesn't let buggy software break its cache coherency in this way, since it would potentially break privilege separation.

On weakly-ordered architectures, memory_order_consume is (in theory) a way to arrange inter-thread data dependencies such that only writes to the shared data have to be waited for by other threads, not writes to thread-private data.

However, this is too hard for compilers to safely and reliably get right, so they currently implement mo_consume as the stronger mo_acquire. I wrote a really long and rambling answer a while ago with a bunch of links to memory-ordering stuff, and a mention of C++11 memory_order_consume.

It's so hard to standardize because different architectures have different rules for what operations carry a dependency. I assume some code-bases have some hand-written asm that takes advantage of dependency ordering. AFAIK, hand-written asm is the only way to take advantage of dependency ordering to avoid memory barrier instructions on a weakly-ordered ISA. (e.g. in a producer-consumer model, or in lockless algorithms that need more than just unordered atomic stores.)

Thanks! I had been using thread local variables for that different problem (and thefore using them rarely, because I usually wouldn't structure an application in a way that needs it--- mostly as a work-around when extending third-party code). I was just wondering if it was _also_ intended as a directive to the hardware about what can't be shared. Apparently not. — Jim Pivarski, May 28 '16 at 15:12
@JimPivarski: I added a section about weakly ordered architectures and what is possible as far as avoiding having threads only wait for the actually-shared stores from other threads. Hardware is pretty good at doing as much out-of-order as possible without changing observable results though, so when threads don't interact at all (no true or false sharing), they don't slow each other down except through competing for memory bandwidth and shared last-level cache. — Peter Cordes, May 28 '16 at 15:54

Can I avoid cache consistency checks by declaring variables as thread-local?

2 Answers2