Maximum cache misses possible from using Thread Local Variables

Question

Referring to question this already asked/answered question: (How are the fs/gs registers used in Linux AMD64?), and this doc referenced in an answer to this question (https://akkadia.org/drepper/tls.pdf)

According to the doc the FS register points to the TCB(Thread control block), which points to the DTV (dynamic thread vector) which ultimately leads to the thread local data. Is it then right to assume we can incur up to 3 cache misses loading a thread local variable? (1 for TCB, 1 for DTV, and 1 for the data itself?

score 3 · Accepted Answer · answered Nov 15 '18 at 16:36

3

According to Godbolt, the following code:

thread_local int t;

int get_t () {
    return t;
}

Generates the following object code:

mov     eax, DWORD PTR fs:t@tpoff
ret

So I make that one memory access. And there is in fact an answer in the post you link to that says the same thing.

answered Nov 15 '18 at 16:36

Paul Sanders

24,133
4
26
48

4

Yup, exactly. The linker resolves the symbol to an offset added to the segment base, not any extra levels of indirection. (Well, fs base is kind of an extra level of indirection, but that's stored inside the CPU in the descriptor cache, not loaded from memory on every use. On Intel CPUs, using a non-zero segment base adds 1 cycle of load-latency. So that and a bit of extra code-size are the only cost to TLS) – Peter Cordes Nov 15 '18 at 16:46

Maximum cache misses possible from using Thread Local Variables

1 Answers1