3

Consider the following example that proves false sharing existence:

using type = std::atomic<std::int64_t>;

struct alignas(128) shared_t
{
  type  a;
  type  b;
} sh;

struct not_shared_t
{
  alignas(128) type a;
  alignas(128) type b;
} not_sh;

One thread increments a by steps of 1, another thread increments b. Increments compile to lock xadd with MSVC, even though the result is unused.

For a structure where a and b are separated, the values accumulated in a few seconds is about ten times greater for not_shared_t than for shared_t.

So far expected result: separate cache lines stay hot in L1d cache, increment bottlenecks on lock xadd throughput, false sharing is a performance disaster ping-ponging the cache line. (editor's note: later MSVC versions use lock inc when optimization is enabled. This may widen the gap between contended vs. uncontended.)


Now I'm replacing using type = std::atomic<std::int64_t>; with plain std::int64_t

(The non-atomic increment compiles to inc QWORD PTR [rcx]. The atomic load in the loop happens to stop the compiler from just keeping the counter in a register until loop exit.)

The reached count for not_shared_t is still greater than for shared_t, but now less than twice.

|          type is          | variables are |      a=     |      b=     |
|---------------------------|---------------|-------------|-------------|
| std::atomic<std::int64_t> |    shared     |   59’052’951|   59’052’951|
| std::atomic<std::int64_t> |  not_shared   |  417’814’523|  416’544’755|
|       std::int64_t        |    shared     |  949’827’195|  917’110’420|
|       std::int64_t        |  not_shared   |1’440’054’733|1’439’309’339|

Why is the non-atomic case so much closer in performance?


Here is the rest of the program to complete the minimum reproducible example. (Also On Godbolt with MSVC, ready to compile/run)

std::atomic<bool> start, stop;

void thd(type* var)
{
  while (!start) ;
  while (!stop) (*var)++;
}

int main()
{
  std::thread threads[] = {
     std::thread( thd, &sh.a ),     std::thread( thd, &sh.b ),
     std::thread( thd, &not_sh.a ), std::thread( thd, &not_sh.b ),
  };

  start.store(true);

  std::this_thread::sleep_for(std::chrono::seconds(2));

  stop.store(true);
  for (auto& thd : threads) thd.join();

  std::cout
    << " shared: "    << sh.a     << ' ' << sh.b     << '\n'
    << "not shared: " << not_sh.a << ' ' << not_sh.b << '\n';
}
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Alex Guteniev
  • 12,039
  • 2
  • 34
  • 79

1 Answers1

5

Non-atomic memory-increments can benefit from store-forwarding when reloading its own stored value. This can happen even while the cache line is invalid. The core knows that the store will happen eventually, and the memory-ordering rules allow this core to see its own stores before they become globally visible.

Store-forwarding gives you the length of the store buffer number of increments before you stall, instead of needing exclusive access to the cache line to do an atomic RMW increment.

When this core does eventually gain ownership of the cache line, it can commit multiple stores at 1/clock. This is 6x faster than the dependency chain created by a memory-destination increment: ~5 cycle store/reload latency + 1 cycle ALU latency. So execution is only putting new stores into the SB at 1/6th the rate it can drain while a core owns it, in the non-atomic case This is why there isn't a huge gap between shared vs. non-shared atomic.

There's certainly going to be some memory ordering machine clears, too; that and/or SB full are the likely reasons for lower throughput in the false sharing case. See answers and comments on What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings? for another experiment somewhat like this one.


A lock inc or lock xadd forces the store buffer to drain before the operation, and includes committing to L1d cache as part of the operation. This makes store forwarding impossible, and can only happen when the cache line is owned in Exclusive or Modified MESI states.

Related:

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Is the difference between shared and non-shared entirely due to the difference between accessing directly L1d and store-forwarding? Or context switches might make significant contribution too (when context switch away and back happens, I guess store buffer has fully gone to RAM, but the cache line may still be intact)? – Alex Guteniev May 08 '20 at 11:41
  • 1
    @AlexGuteniev: The former; I assumed you were testing on a multi-core system that was idle enough that no context switches happened during the test. i.e. that all 4 threads had a core to themselves during the test, and were running essentially un-interrupted most of the time. In the false sharing cases, pinning both threads to one logical core would give a small / large speedup because having one thread asleep while the other runs removes false sharing. – Peter Cordes May 08 '20 at 11:47
  • Sure I ran this benchmark on idle system. So I see there's almost no room for context switches to contribute, most of the time we expect no context switches, in pathological one-core cases there is no false sharing either. I still can imagine a situation that threads are on different cores, but each core switches to some other threads. Anyway as we optimize for the case when resources are available, context switches should not be relevant. – Alex Guteniev May 08 '20 at 12:15
  • @AlexGuteniev: Yes, false sharing is basically only a problem when both threads are actively pounding on the same cache line, not when one is sleeping. – Peter Cordes May 08 '20 at 12:37
  • 1
    @Alex: Also, in your first comment, you said: *when context switch away and back happens, I guess store buffer has fully gone to RAM* - not quite. A context switch has to include a full barrier or at least a release-store when saving context (so if another thread starts executing that thread, its loads will see *its own* stores). But that doesn't mean going to DRAM. Coherent cache means that draining the SB to local L1d is sufficient. Write-back only on demand. If you meant RAM = coherent shared memory cache hierarchy, then yes. If you meant RAM = DRAM like is common, then no. – Peter Cordes May 08 '20 at 12:38
  • Aha, so does having to handle context switching mean that a cache level that is shared between cores and still not the actual DRAM should always exists? – Alex Guteniev May 08 '20 at 12:49
  • 1
    @AlexGuteniev: No, having a shared last-level cache is not required or even really relevant to implementing cache coherency; see https://en.wikipedia.org/wiki/MESI_protocol which works for "sibling" caches. Of course you do want a shared cache as a backstop for coherency traffic, especially if your design can't send "dirty" data directly from one L1d to another ([like AMD can with MOESI](//stackoverflow.com/q/49983405)). And Intel uses the tags in its inclusive L3 as a snoop filter / directory to keep track of which core owns which line. – Peter Cordes May 08 '20 at 12:55
  • 1
    @AlexGuteniev: There are many reasons that having coherent caches is very very useful. Context switching is one of the least important. It would be trivial for the OS to do an explicit flush to make data visible to other cores, if one was needed on non-coherent memory. See also [Is mov + mfence safe on NUMA?](//stackoverflow.com/q/54652663) re: how universal ccNUMA is, and what it would mean not to have it. Also [When to use volatile with multi threading?](//stackoverflow.com/a/58535118) re: coherent caches. Without coherence, every atomic release store would have to flush everything. – Peter Cordes May 08 '20 at 12:57