How avoid cache line invalidation from multiple threads writing to a shared array?

Question

Context of the problem:

I am writing a code that creates 32 threads, and set affinity of them to each one of the 32 cores in my multi-core-multi-processor system.

Threads simply execute the RDTSCP instruction and the value is stored in a shared array at a non-overlapping position, this is the shared array:

uint64_t rdtscp_values[32];

So, every thread is going to write to the specific array position based on its core number.

Up to know, everything is working properly with the exception that I know that I may not be using the right data structure to avoid cache line bouncing.

P.S: I have checked already that my processor's cache line is 64-bytes wide.

Because I am using a simple uint64_t array, it implies that a single cache line is going to store 8 positions of this array, because of the read-ahead.

Question:

Because of this simple array, although the threads write to different indexes, my understanding tells that every write to this array will cause a cache invalidation to all other threads?

How could I create a structure that is aligned to the cache line?

EDIT 1

My system is: 2x Intel Xeon E5-2670 2.30GHz (8 cores, 16 threads)

if any intel expert had free time and is on stack overflow maybe you will have your answer one day. — Stargateur, Oct 24 '17 at 12:19
As I understand it, not all other threads, but "only" 8 other threads. You could try to declare the array as `uint64_t rdtscp_values[32*8];` and access `rdtscp_values[i*8]` instead of `rdtscp_values[i]` to avoid cache line bouncing. Just try it out and make a performance test to see if it improves overall results. Tools like `perf` may help you with the evaluation. — Ctx, Oct 24 '17 at 12:46
Hint: check __attribute__ ((aligned (64))) (in case of gcc) and alignas(64) in case of C11. — Anty, Oct 24 '17 at 12:49
The cache system of different CPU's works differently so you probably should post info about the exact CPU's you have. Besides that the first thing to do is to make sure that the data accessed be the individual threads maps to different cache lines — Support Ukraine, Oct 24 '17 at 12:50
Addendum: I assume, that you do not have 32 processors, but n processors with m cores. If you have, for example, 4 processors with 8 cores (making 32 cores in total) it should be sufficient to make sure, that threads 0-7 are on core 0, 8-15 on core 1, 16-23 on core 2 and 24-31 on core 3. Then you should have the optimal distribution with no cache line bouncing with your original array. Just make sure to align it on a 64-byte boundary (with `uint64_t *rdtscp_values = memalign(64, 256);` for example — Ctx, Oct 24 '17 at 12:50
There is a similar question [here](https://stackoverflow.com/questions/46173972/cache-friendly-way-to-collect-results-from-multiple-threads). Tagging [x86] will greatly help you get an answer. — Margaret Bloom, Oct 24 '17 at 13:53
To begin with, why are you concerned with cache use of this single array? It is obviously just an array where you dump thread result/status and not actual data. It will be far more important to ensure that the data that each thread is working with is aligned according to your data cache line size. But of course this is all an exercise in pre-mature optimization - ideally the OS should assign threads to a certain core, not the application itself. Unless you are writing an OS scheduler, that is. — Lundin, Oct 24 '17 at 14:47
@Lundin I am working on some optimizations, and for that reason I need to run threads as light as possible. So, one of my concerns is about avoiding these cache invalidation, which causes unnecessary overhead. — campescassiano, Oct 24 '17 at 16:07
@Ccampes: Lundin’s point is that writing to this array is likely a very small portion of the work the threads do. Usually when we see something like this, the threads are doing lots of computations and occasionally writing summary results to the array. Even if cache line invalidations for the array occur, that is likely a tiny portion of your execution time, and optimizing it will not produce a noticeable effect. The additional reload time might even vanish because it will overlap with other work the threads do. — Eric Postpischil, Oct 24 '17 at 16:25
That said, if you really want the threads not to interfere with each other, a simple solution is to declare the array as `uint64_t rdtscp_values[32][8]` and have each thread write to `rdtscp_values[index][0]`. This spreads out the used elements so they are 64 bytes apart, each in its own cache line. Likely a waste of memory, but you could try it to see if there is any effect. Of course, then you are using more cache lines, so the threads may interfere with each other by causing evictions due to full cache sets rather than due to taking ownership of shared lines. So this could actually hurt. — Eric Postpischil, Oct 24 '17 at 16:28
*Why* are you storing the results of `rdtscp` in an array repeatedly? Presumably, so another thread can read in between, right? Otherwise, you'd either: a) only use `rdtscp`once or b) keep the results thread-local on the stacks. This kind of information should be in the question. — EOF, Oct 24 '17 at 16:56
@EOF RDTSCP is just to get "random" values, does not matter actually this. The concern here is about the cache lines invalidation. — campescassiano, Oct 24 '17 at 17:26
Thank you all. I edited the post with the answer based on the comments and suggestions you guys posted here. Thank you again. — campescassiano, Oct 24 '17 at 17:32
@Ccampes Instead of adding edit 2 to the question, you should post it as an answer to the question. Answering your own questions is perfectly fine on SO. — Lundin, Oct 25 '17 at 08:40

score 6 · Accepted Answer · answered Oct 26 '17 at 03:36

Yes you definitely want to avoid "false sharing" and cache-line ping-pong. But this probably doesn't make sense: if these memory locations are thread-private more often than they're collected by other threads, they should be stored with other per-thread data so you're not wasting cache footprint on 56 bytes of padding. See also Cache-friendly way to collect results from multiple threads. (There's no great answer; avoid designing a system that needs really fine-grained gathering of results if you can.)

But let's just assume for a minute that unused padding between slots for different threads is actually what you want.

Yes, you need the stride to be 64 bytes (1 cache line), but you don't actually need the 8B you're using to be at the start of each cache line. Thus, you don't need any extra alignment as long as the uint64_t objects are naturally-aligned (so they aren't split across a cache-line boundary).

It's fine if each thread is writing to the 3rd qword of its cache line instead of the 1st. OTOH, aligning to 64B makes sure nothing else is sharing a cache line with the first element, and it's easy so we might as well.

Static storage: aligning static storage is very easy in ISO C11 using alignas(), or with compiler-specific stuff.

With a struct, padding is implicit to make the size a multiple of the required alignment. Having one member with an alignment requirement implies that the whole struct requires at least that much alignment. The compiler takes care of this for you with static and automatic storage, but you have to use aligned_alloc or an alternative for over-aligned dynamic allocation.

#include <stdalign.h>   // for #define alignas _Alignas  for C++ compat
#include <stdint.h>     // for uint64_t

// compiler knows the padding is just padding
struct { alignas(64) uint64_t v; } rdtscp_values[32];

int foo(unsigned t) {
    rdtscp_values[t].v = 1;
    return sizeof(rdtscp_values[0]);  // yes, this is 64
}

Or with an array as suggested by @ Eric Postpischil:

alignas(64) // optional, stride will still be 64B without this.
uint64_t rdtscp_values_2d[32][8];  // 8 uint64_t per cache line

void bar(unsigned t) {
    rdtscp_values_2d[t][0] = 1;
}

alignas() is optional if you don't care about the whole thing being 64B aligned, just having 64B stride between elements you use. You could also use __attribute__((aligned(64))) in GNU C or C++, or __declspec(align(64)) for MSVC, using #ifdef to define an ALIGN macro that's portable across the major x86 compilers.

Either way produces the same asm. We can check compiler output to verify that we got what we wanted. I put it up on the Godbolt compiler explorer. We get:

foo:   # and same for bar
    mov     eax, edi              # zero extend 32-bit to 64-bit
    shl     rax, 6                # *64 is the same as <<6
    mov     qword ptr [rax + rdtscp_values], 1    # store 1

    mov     eax, 64               # return value = 64 = sizeof(struct)
    ret

Both arrays are declared the same way, with the compiler requesting 64B alignment from the assembler/linker with the 3rd arg to .comm:

    .comm   rdtscp_values_2d,2048,64
    .comm   rdtscp_values,2048,64

Dynamic storage:

If the number of threads is not a compile-time constant, then you can use an aligned allocation function to get aligned dynamically-allocated memory (especially if you want to support a very high number of threads). See How to solve the 32-byte-alignment issue for AVX load/store operations?, but really just use C11 aligned_alloc. It's perfect for this, and returns a pointer that's compatible with free().

struct { alignas(64) uint64_t v; } *dynamic_rdtscp_values;
void init(unsigned nthreads) {
    size_t sz = sizeof(dynamic_rdtscp_values[0]);
    dynamic_rdtscp_values = aligned_alloc(nthreads*sz, sz);
}

void baz(unsigned t) {
    dynamic_rdtscp_values[t].v = 1;
}


 baz:
    mov     rax, qword ptr [rip + dynamic_rdtscp_values]

    mov     ecx, edi            # same code as before to scale by 64 bytes
    shl     rcx, 6
    mov     qword ptr [rax + rcx], 1
    ret

The address of the array is no longer a link-time constant, so there's an extra level of indirection to access it. But the pointer is read-only after it's initialized, so it will stay shared in cache in each core and reloading it when needed is very cheap.

Footnote: In the i386 System V ABI, uint64_t only has 4B-alignment inside structs by default (without alignas(8) or __attribute__((aligned(8)))), so if you put an int before a uint64_t and didn't do any alignment of the whole struct, it would be possible to get cache-line splits. But compilers align it by 8B whenever possible, so your struct-with padding is still fine.

(update: C++17 makes `new` respect over-aligned types.) – Peter Cordes Jul 13 '22 at 04:26 — Peter Cordes, Jul 13 '22 at 04:26

score 2 · Answer 2 · answered Oct 25 '17 at 13:02

SOLUTION

So, I followed the comments here and I must say thanks for all contributions.

Finally I got what I expected: cache lines being used properly per each thread.

Here is the shared structure:

typedef struct align_st {
    uint64_t v;
    uint64_t padding[7];
} align_st_t __attribute__ ((aligned (64)));

I am using a padding uint64_t padding[7] inside the structure to fill the remaining bytes in the cache line when this structure is loaded to the L1 cache. Nonetheless, I am asking to the compiler to use 64-bytes memory alignment when compiling it __attribute__ ((aligned (64))).

So, I allocate this structure dynamically based on the number of cores, using the memalign() for this:

align_st_t *al = (align_st_t*) memalign(64, n_cores * sizeof(align_st_t));

To compare it, I wrote one code version (V1) that uses these aligned mechanisms, and other code version (V2) that uses the simple array method.

By executing with perf, and I got these numbers:

V1: 7.184 cache-misses;
V2: 2.621.347 cache-misses.

P.S.: Each thread is writing 1-thousand times to the same address of the shared structure just to increase the numbers

I added also a method that is used sometimes, in order to avoid using `attribute` instrumentation, I saw this method implemented in a few places. — alinsoar, Oct 25 '17 at 14:53
`struct { alignas(64) uint64_t v; };` should work in C11 (with `` or with C++11. (You can make the padding explicit if you want to use it for anything, otherwise leave it out so the compiler knows it doesn't have to copy it.) — Peter Cordes, Oct 25 '17 at 22:54
[`memalign` is deprecated](https://linux.die.net/man/3/memalign). Use `aligned_alloc` if you want dynamic storage. — Peter Cordes, Oct 26 '17 at 03:38

How avoid cache line invalidation from multiple threads writing to a shared array?

Context of the problem:

Question:

EDIT 1

2 Answers2

SOLUTION