NUMA systems, virtual pages, and false sharing

Question

As I understand things, for perfromance on NUMA systems, there are two cases to avoid:

threads in the same socket writing to the same cache line (usually 64 bytes)
threads from different sockets writing to the same virtual page (usually 4096 bytes)

A simple example will help. Let's assume I have a two socket sytem and each socket has a CPU with two physical cores (and two logical cores i.e. no Intel hyper-threading or AMD two cores per module). Let me borrow the digram at OpenMP: for schedule

| socket 0    | core 0 | thread 0 |
|             | core 1 | thread 1 |

| socket 1    | core 2 | thread 2 |
|             | core 3 | thread 3 |

So based on case 1 it's best to avoid e.g. thread 0 and thread 1 writing to the same cache line and based on case 2 it's best to avoid e.g. thread 0 writing to the same virtual page as thread 2.

However, I have been informed that on modern processors that the second case is no longer a concern. Threads between sockets can write to the same virtual page efficiently (as long as they don't write to the same cache line).

Is case two no longer a problem? And if it is still a problem what's the correct terminology for this? Is is correct to call both cases a kind of false sharing?

@timday, it's in the comments here. I don't have a NUMA system. I only know what I read and from what I have read case 2 still applies but now I don't know. — Z boson, Feb 15 '14 at 07:38
opps...I mean in the comments here http://stackoverflow.com/questions/21741802/why-would-parallelization-decrease-performance-so-dramatically/21748825#21748825 — Z boson, Feb 15 '14 at 09:59

Aaron Altman · Answer 1 · 2014-05-20T20:50:11.717

You're right about case 1. Some more details about case 2:

Based on the operating system's NUMA policy and any related migration issues, the physical location of the page that threads 0 and 2 are writing to could be socket 0 or socket 1. The cases are symmetrical so let's say that there's a first touch policy and that thread 0 gets there first. The sequence of operations could be:

Thread 0 allocates the page.
Thread 0 does a write to the cache line it'll be working on. That cache line transitions from invalid to modified within cache(s) on socket 0.
Thread 2 does a write to the cache line it'll be working on. To put that line in exclusive state, socket 1 has to send a Read For Ownership to socket 0 and receive a response.
Threads 0 and 2 can go about their business. As long as thread 0 doesn't touch thread 2's cache line or vice versa and nobody else does anything that would change the state of either line, all operations that thread 0 and thread 2 are doing are socket- (and possibly core-) local.

You could swap the order of 2. and 3. without affecting the outcome. Either way, the round trip between sockets in step 3 is going to take longer than the socket-local access in step 2, but that cost is only incurred once for each time thread 2 needs to put its line into a modified state. If execution continues for long enough in between transitions in the state of that cache line, the extra cost will amortize.

NUMA systems, virtual pages, and false sharing

1 Answers1