6

As I understand things, for perfromance on NUMA systems, there are two cases to avoid:

  1. threads in the same socket writing to the same cache line (usually 64 bytes)
  2. threads from different sockets writing to the same virtual page (usually 4096 bytes)

A simple example will help. Let's assume I have a two socket sytem and each socket has a CPU with two physical cores (and two logical cores i.e. no Intel hyper-threading or AMD two cores per module). Let me borrow the digram at OpenMP: for schedule

| socket 0    | core 0 | thread 0 |
|             | core 1 | thread 1 |

| socket 1    | core 2 | thread 2 |
|             | core 3 | thread 3 |

So based on case 1 it's best to avoid e.g. thread 0 and thread 1 writing to the same cache line and based on case 2 it's best to avoid e.g. thread 0 writing to the same virtual page as thread 2.

However, I have been informed that on modern processors that the second case is no longer a concern. Threads between sockets can write to the same virtual page efficiently (as long as they don't write to the same cache line).

Is case two no longer a problem? And if it is still a problem what's the correct terminology for this? Is is correct to call both cases a kind of false sharing?

Community
  • 1
  • 1
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • What's your source for "I have been informed"? – timday Feb 14 '14 at 22:49
  • @timday, it's in the comments here. I don't have a NUMA system. I only know what I read and from what I have read case 2 still applies but now I don't know. – Z boson Feb 15 '14 at 07:38
  • 1
    opps...I mean in the comments here http://stackoverflow.com/questions/21741802/why-would-parallelization-decrease-performance-so-dramatically/21748825#21748825 – Z boson Feb 15 '14 at 09:59

1 Answers1

2

You're right about case 1. Some more details about case 2:

Based on the operating system's NUMA policy and any related migration issues, the physical location of the page that threads 0 and 2 are writing to could be socket 0 or socket 1. The cases are symmetrical so let's say that there's a first touch policy and that thread 0 gets there first. The sequence of operations could be:

  1. Thread 0 allocates the page.
  2. Thread 0 does a write to the cache line it'll be working on. That cache line transitions from invalid to modified within cache(s) on socket 0.
  3. Thread 2 does a write to the cache line it'll be working on. To put that line in exclusive state, socket 1 has to send a Read For Ownership to socket 0 and receive a response.
  4. Threads 0 and 2 can go about their business. As long as thread 0 doesn't touch thread 2's cache line or vice versa and nobody else does anything that would change the state of either line, all operations that thread 0 and thread 2 are doing are socket- (and possibly core-) local.

You could swap the order of 2. and 3. without affecting the outcome. Either way, the round trip between sockets in step 3 is going to take longer than the socket-local access in step 2, but that cost is only incurred once for each time thread 2 needs to put its line into a modified state. If execution continues for long enough in between transitions in the state of that cache line, the extra cost will amortize.

Aaron Altman
  • 1,705
  • 1
  • 14
  • 22