False sharing and 128-byte alignment/padding

Question

While doing some research about lock-free/wait-free algorithms, I stumbled upon the false sharing problem. Digging a bit more led me to Folly's source code (Facebook's C++ library) and more specifically to this header file and the definition of the FOLLY_ALIGN_TO_AVOID_FALSE_SHARING macro (currently at line 130). What surprised me the most at first glance was the value: 128 (i.e.: instead of 64)...

/// An attribute that will cause a variable or field to be aligned so that
/// it doesn't have false sharing with anything at a smaller memory address.
#define FOLLY_ALIGN_TO_AVOID_FALSE_SHARING __attribute__((__aligned__(128)))

AFAIK, cache blocks on modern CPUs are 64 bytes long and actually, every resources I found so far on the matter, including this article from Intel, talk about 64 bytes aligning and padding to help work around false sharing.

Still, the folks at Facebook align and pad their class members to 128 bytes when needed. Then I found out the beginning of an explanation just above FOLLY_ALIGN_TO_AVOID_FALSE_SHARING's definition:

enum {
    /// Memory locations on the same cache line are subject to false
    /// sharing, which is very bad for performance.  Microbenchmarks
    /// indicate that pairs of cache lines also see interference under
    /// heavy use of atomic operations (observed for atomic increment on
    /// Sandy Bridge).  See FOLLY_ALIGN_TO_AVOID_FALSE_SHARING
    kFalseSharingRange = 128
};

While it gives me a bit more details, I still feel I need some insights. I'm curious about how the sync of contiguous cache lines, or any RMW operation on them could interfere with each other under heavy use of atomic operations. Can someone please enlighten me on how this can even possibly happen?

I enjoyed reading Herb Sutter's explanation at Dr Dobbs some time ago: http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206 — Robinson, Mar 22 '15 at 21:02
Intel optimization manual, chapter 2.1.5.4. The spatial prefetcher strives to keep pairs of cache lines in the L2 cache. — Hans Passant, Mar 23 '15 at 08:14
@Hans: I would accept that as an answer (found it in chapter 2.2.5.4 though). — polyvertex, Mar 25 '15 at 16:01
Yours is probably more up-to-date. You can write your own answer. — Hans Passant, Mar 25 '15 at 16:08

score 6 · Answer 1 · answered Apr 14 '19 at 08:52

6

As Hans pointed out in a comment, some info about this can be found in "Intel® 64 and IA-32 architectures optimization reference manual", in section 3.7.3 "Hardware Prefetching for Second-Level Cache", about the Intel Core microarchitecture:

"Streamer — Loads data or instructions from memory to the second-level cache. To use the streamer, organize the data or instructions in blocks of 128 bytes, aligned on 128 bytes. The first access to one of the two cache lines in this block while it is in memory triggers the streamer to prefetch the pair line."

answered Apr 14 '19 at 08:52

Elias

913
6
22

1

In modern microarchitectures like Sandybridge-family, it's actually the L2 spatial prefetcher that likes to complete aligned pairs of cache lines; the streamer is separate. (Intel Core is ancient, like Core2Duo Conroe and Penryn from ~2007.) – Peter Cordes Nov 08 '21 at 08:34
[bytes aligned and false sharing cause performance diff on x86-64](https://stackoverflow.com/q/69879789) has benchmarks on some unknown recent x86 microarchitecture. See also [Understanding std::hardware\_destructive\_interference\_size and std::hardware\_constructive\_interference\_size](https://stackoverflow.com/q/39680206) for more details about why 128 makes sense on modern x86. – Peter Cordes Nov 08 '21 at 08:35

score 0 · Answer 2 · edited Jun 20 '20 at 09:12

It seems that, while Intel uses 64 bytes cache lines, there are various other architectures that use 128 bytes cache lines... for example:

http://goo.gl/8L6cUl

Power Systems use 128-byte length cache lines. Compared to Intel processors (64-byte cache lines), these larger cache lines have...

I've found scattered around internet notes that other architectures, even old ones, do the same:

http://goo.gl/iNAZlX

SGI MIPS R10000 Processor in the Origin Computer

The processor has a cache line size of 128 bytes.

So probably Facebook programmers wanted to play it safe and didn't want to have a big collection of #define/#if based on processor architecture, with the risk that some newer Intel processors had a 128 bytes cache line and no one remembered to correct the code.

Still, `pair of cache line also see interference ... observed for atomic increment on Sandy Bridge` ... It really seems they updated the `FOLLY_ALIGN_TO_AVOID_FALSE_SHARING` value because of a series of tests on **that** particular architecture. i.e.: 64 bytes alignment and padding wasn't enough according to these results. — polyvertex, Mar 23 '15 at 07:35

score -1 · Answer 3 · edited Mar 22 '15 at 21:33

Whether you use atomic operations or not, the cache has a "cache-line", which is the smallest unit that the cache operates on. This ranges from 32 to 128 bytes, depending on processor model. False sharing is when elements within the same cache-line are "shared" between different threads (that run on different processors[1]). When this happens, one processor updating "its value", will force all other processors to "get rid of its copy" of that data. It gets worse in case of atomic operations, because to perform any atomic operation, the processor performing the operation will need to ensure all other processors have got rid of "their copies" before it can update the value (to ensure no other processor is using an "old" value before the value has been updated) - this requires a lot of cache-maintenance messages to be propagated through the system, and processors to re-load the values that they previously had in the cache.

So, from a performance perspective, if you have variables that are used by one thread, separate them out to their own cache-line (in the example in the original post, this is assumed to be 128 bytes) by aligning the data to that value - meaning that each lump of data starts on an even cache-line boundary and no other processor will "share" the same data (unless you are genuinely sharing the data between thread - at which point you HAVE to do the relevant cache-maintenance to ensure the data is correctly updated between the processors)

[1] Or processor cores in modern CPU with multiple cores. For simplicity, I've used the term "processor" or "processors" to correspond to either real processor sockets or processor cores within one socket. For this discussion, the distinction is pretty much irrelevant.

I understand the concept of false-sharing. My question was about the value they use to do the alignment and padding (128 bytes) vs. the architecture that _forced_ them to choose this value (i.e.: Sandy Bridge, which has 64 bytes cache lines AFAIK). — polyvertex, Mar 25 '15 at 16:10
I thought some of the older Intel architectures also used 128-byte cache-lines. Maybe they haven't ONLY got processors produced in the last few years? — Mats Petersson, Mar 25 '15 at 20:46
You've made a thorough answer and I thank you for that. But please read my admittedly-too-long-and-probably-unclear question: `Microbenchmarks indicate that pairs of cache lines also see interference ... (observed ... on Sandy Bridge).` There's no much room for doubt concerning the architecture here. They've changed the alignment to 128 bytes because of bad results with **Sandy Bridge**. What I would like to understand is why and how a larger padding would reduce false-sharing in that particular case. @Hans' comment to the question might be of interest for you as it has been for me. — polyvertex, Mar 25 '15 at 21:25
Unfortuantely, I'm not particularly familiar with the latest Intel processors. When I was doing benchmarking at AMD some 12-15 years back, I was quite in tune with what the differences between models were. Not so much these days, as my home machines all have AMD processors (still loyal to the brand where I worked now about 10 years ago) and only using ARM processors at work for the last 8 or so years. [Well, my work machine itself has had an Intel processor of some variant or other, but I've not really paid much attention to what or how it's behaving] — Mats Petersson, Mar 25 '15 at 21:50

False sharing and 128-byte alignment/padding

3 Answers3

Linked

Related