As I understand things, for perfromance on NUMA systems, there are two cases to avoid:
- threads in the same socket writing to the same cache line (usually 64 bytes)
- threads from different sockets writing to the same virtual page (usually 4096 bytes)
A simple example will help. Let's assume I have a two socket sytem and each socket has a CPU with two physical cores (and two logical cores i.e. no Intel hyper-threading or AMD two cores per module). Let me borrow the digram at OpenMP: for schedule
| socket 0 | core 0 | thread 0 |
| | core 1 | thread 1 |
| socket 1 | core 2 | thread 2 |
| | core 3 | thread 3 |
So based on case 1 it's best to avoid e.g. thread 0 and thread 1 writing to the same cache line and based on case 2 it's best to avoid e.g. thread 0 writing to the same virtual page as thread 2.
However, I have been informed that on modern processors that the second case is no longer a concern. Threads between sockets can write to the same virtual page efficiently (as long as they don't write to the same cache line).
Is case two no longer a problem? And if it is still a problem what's the correct terminology for this? Is is correct to call both cases a kind of false sharing?