How many cycles does false sharing cost?

Question

My target platforms are windows and linux with x86-64 (coffe lake or higher, zen 2 or higher) and mac m2. I'm wondering is there a penalty for multiple threads accessing the same data at the same time? and how much of a penalty there is if one thread changes one variable once. Do other cores stall immediately if that cache line is loaded? How many cycles does it take to update? From my understanding false sharing happens when you change a byte on a line, is this strictly 128 bytes and less? I don't have to worry about TLB? here's my situation

I have a few objects which is a source of truth for some data. I can't remember if they're 64 bytes or over. I have a 32bit status flag in the first 64 bytes. Many threads may access this and bytes next to it sometimes 100 times, sometimes one other thread once. I'm not sure how many nanoseconds between a write and read will be but only one write will happen

C++ thread sanitizer complained that I'm changing the flag in one thread and reading in another, neither using atomic operations. The other threads don't need to see the update since I simply set a bit they don't care about. I was thinking I can use atomic load/store with atomic_relaxed.

Another option is having a pointer and going through that to update the data. I was thinking if 1K objects are written to once and every other thread happen to read it within 10ns, would it be a problem? How many cycles would that stall? This is assuming there's no penalties when many cores are reading the same data. I have a bit of a memory bandwidth problems (I'm writing a lot of data) so I'm concerned about using more data when I don't need to

A performance experiment on Skylake (same microarchitecture as Coffee Lake) - [Should the cache padding size of x86-64 be 128 bytes?](https://stackoverflow.com/q/72126606) - includes `perf stat` results for the `machine_clears.memory_ordering` event with two threads incrementing nearby counters. (With pure stores or atomic RMWs, you'd just get cache misses, not pipeline nukes, but plain loads can load early and on x86 that's has to be speculatively to maintain the strong memory ordering.) My answer there also explains some details that should make the concept clearer. — Peter Cordes, Feb 09 '23 at 05:28
Yes, an aligned 128 byte pair of cache lines is the largest size you have to worry about for false sharing. — Peter Cordes, Feb 09 '23 at 05:30
*C++ thread sanitizer complained that I'm changing the flag in one thread and reading in another, neither using atomic operations.* - That's not **false** sharing, that's true sharing. Inter-thread latency is often on the order of 40 ns. More for a many-core CPU, depending on how it clusters groups of cores together and how far away it is. See https://travisdowns.github.io/blog/2020/07/06/concurrency-costs.html and https://www.anandtech.com/show/16315/the-ampere-altra-review/3 has plots of an Ampere Altra vs. a 2-Socket Intel Xeon Platinum 8280 (latency higher between sockets), and an Epyc. — Peter Cordes, Feb 09 '23 at 06:37
To me the description doesn't sound like false sharing (see https://en.wikipedia.org/wiki/False_sharing). — Support Ukraine, Feb 09 '23 at 07:07

How many cycles does false sharing cost?

0 Answers0