1

Assume that we have multiple threads accessing the same cache line parallelly. One of them (writer) repeatedly write to that cache line, and read from it. The other threads (readers) only repeatedly read from it. I am clear that the readers suffer from performance degradation since the invalidation-based MESI protocol requires the readers to invalidate their local cache before the writer writes to it, which occurs frequently. But I think the writer's read to that cache line should be fast because local write will not cause such invalidation.

However, the strange thing is that, when I run such experiment on a dual-socket machine with two Intel Xeon Scalable Gold 5220R processors (24 cores each) running at 2.20GHz, the writer's read to that cache line becomes a performance bottleneck.

This is my test program (compiled with gcc 8.4.0, -O2):

#define _GNU_SOURCE

#include <stdio.h>
#include <pthread.h>
#include <sched.h>
#include <unistd.h>
#include <sys/syscall.h>

#define CACHELINE_SIZE  64

volatile struct {
    /* cacheline 1 */
    size_t x, y;
    char padding[CACHELINE_SIZE - 2 * sizeof(size_t)];
    /* cacheline 2 */
    size_t p, q;
} __attribute__((aligned(CACHELINE_SIZE))) data;

static inline void bind_core(int core) {
    cpu_set_t mask;
    CPU_ZERO(&mask);
    CPU_SET(core, &mask);
    if ((sched_setaffinity(0, sizeof(cpu_set_t), &mask)) != 0) {
        perror("bind core failed\n");
    }
}

#define smp_mb()    asm volatile("lock; addl $0,-4(%%rsp)" ::: "memory", "cc")

void *writer_work(void *arg) {
    long id = (long) arg;
    int i;
    bind_core(id);
    printf("writer tid: %ld\n", syscall(SYS_gettid));
    while (1) {
        /* read after write */
        data.x = 1;
        data.y;
        for (i = 0; i < 50; i++) __asm__("nop"); // to highlight bottleneck
    }
}

void *reader_work(void *arg) {
    long id = (long) arg;
    bind_core(id);
    while (1) {
        /* read */
        data.y;
    }
}

#define NR_THREAD   48

int main() {
    pthread_t threads[NR_THREAD];
    int i;
    printf("%p %p\n", &data.x, &data.p);
    data.x = data.y = data.p = data.q = 0;
    pthread_create(&threads[0], NULL, writer_work, 0);
    for (i = 1; i < NR_THREAD; i++) {
        pthread_create(&threads[i], NULL, reader_work, i);
    }
    for (i = 0; i < NR_THREAD; i++) {
        pthread_join(threads[i], NULL);
    }
    return 0;
}

I use perf record -t <tid> to collect cycles event for the writer thread and use perf annotate writer_work to show detailed proportion in writer_work:

         :          while (1) {
         :              /* read after write */
         :              data.x = 1;
    0.20 :        a50:   movq   $0x1,0x200625(%rip)        # 201080 <data>
         :              data.y;
   94.40 :        a5b:   mov    0x200626(%rip),%rax        # 201088 <data+0x8>
    0.03 :        a62:   mov    $0x32,%eax
    0.00 :        a67:   nopw   0x0(%rax,%rax,1)
         :              for (i = 0; i < 50; i++) __asm__("nop");
    0.03 :        a70:   nop
    0.03 :        a71:   sub    $0x1,%eax
    5.17 :        a74:   jne    a70 <writer_work+0x50>
    0.15 :        a76:   jmp    a50 <writer_work+0x30>

It seems that the load instruction of data.y becomes the bottleneck.

First, I think the performance degradation should be related to the cache line modification, because when I comment out the writer's store operation, the perf result indicates that the writer's read is not the bottleneck any more. Second, I think it should have something to do with the concurrent readers, because when I comment out the readers' read, writer's read is not blamed by perf. Concurrent readers accessing the same cacheline also slow down the writer, because when I change sum += data.y to sum += data.z in the writer side, which leads to a load to another cache line, the count decreases.

There is another possibility, suggested by Peter Cordes, that it's whatever instruction follows the store that's getting the blame. However, when I move the nop loop before the load, perf still blames for the load instruction.

         :      writer_work():
         :          while (1) {
         :              /* read after write */
         :              data.x = 1;
    0.00 :        a50:   movq   $0x1,0x200625(%rip)        # 201080 <data>
    0.03 :        a5b:   mov    $0x32,%eax
         :              for (i = 0; i < 50; i++) __asm__("nop");
    0.03 :        a60:   nop
    0.09 :        a61:   sub    $0x1,%eax
    6.24 :        a64:   jne    a60 <writer_work+0x40>
         :              data.y;
   93.60 :        a66:   mov    0x20061b(%rip),%rax        # 201088 <data+0x8>
         :              data.x = 1;
    0.02 :        a6d:   jmp    a50 <writer_work+0x30>

So my question is, what caused the performance degradation when reading a local modified cache line with concurrent readers accessing it? Any help would be appreciated!

hhusjr
  • 43
  • 4
  • 1
    It's probably the store that actually causing the slowdown (having to Read For Ownership (RFO) to get exclusive access again after transitioning to MESI Shared state in response to read requests). It's interesting that the reload gets the blame for the `cycles` event; I wonder if that's coincidence. Perhaps the store buffer and then load buffer entries filling up preventing new loads from allocating? But usually the CPU blames an instruction that's one of the oldest not-yet-retired instructions. – Peter Cordes Jul 31 '22 at 19:13
  • Are you getting any counts from `machine_clears.memory_ordering`? I'd guess probably not, since the writing core has a valid copy of the line the whole time, I expect. – Peter Cordes Jul 31 '22 at 19:14
  • 1
    The instruction that sees all the counts is the instruction right after a store. Alloc/rename of stores are going to bottleneck on the store buffer being full. (But any store that's close to retirement must already be in the back-end, and thus already have a store-buffer entry allocated for it.) I'd suggest changing the asm (by hand; edit the `gcc -S` output then assemble that `.s`) to put a couple `nop`s there, and see if it's actually whatever instruction follows the store that's getting the blame. – Peter Cordes Jul 31 '22 at 19:18
  • 1
    I got no counts from `machine_clears.memory_ordering`. I also put some `nop`s after the store, but `perf` still blames for the load instruction (see the edited question). – hhusjr Aug 01 '22 at 02:20
  • 1
    Thanks for testing those guesses, that narrows down the possible reasons for that specific instruction getting the blame for the extra slowness from having to wait for RFOs before stores can commit. (It is fully expected that the writer would slow down when there's a reader, since it will have to lose *exclusive* ownership of the cache line. At least on Intel CPUs that use MESFI; possibly a MOESI coherency system like AMD uses could be different, allowing the writer to commit stores and broadcast updates, [if wikipedia is actually right about that](//en.wikipedia.org/wiki/MOESI_protocol).) – Peter Cordes Aug 01 '22 at 02:27
  • 1
    I wonder if all the other loads/stores introduced by `-O0` anti-optimization are much of a factor. Could be interesting to compare a version build with `-O2`, with `std::atomic` with `memory_order_relaxed`. Also I hadn't realized earlier you were reloading a *different* part of the cache line, not store-forwarding from the recent store. IDK if that would matter; the writer should still have a readable copy of the cache line at all times, either Modified or Shared state. But perhaps an experiment worth trying. (Use `volatile` or `atomic` relaxed to force store/reload on same obj) – Peter Cordes Aug 01 '22 at 02:56
  • Thanks for the analysis! I modified the program using `volatile`+`O2` (see the edited question). Without additional loads/stores, the previous results still hold true. Maybe the slowdown is mainly related to the RFO as you said since when I change `data.y` to `data.x`, which hits the store buffer, the count decreased from 90%+ to 6%. But why does the load need to wait for the store to commit? To prevent re-order with previous store to the same cache line? But I learnt from https://stackoverflow.com/q/35830641/18134170 that x86 only guarantees order for store and the load to exact same location – hhusjr Aug 01 '22 at 16:54
  • The load doesn't need to wait for the store to *commit* (to L1d and become globally visible), only execute if it's reloading it. That's required for simple single-threaded correctness. That statement in Intel's manual that "However, loads are not reordered with stores to the same location." is a *lot* less interesting than it looks. As my answer on that Q&A points out, if it meant commit, that would make store/reload into a memory barrier, but it isn't. – Peter Cordes Aug 01 '22 at 18:51
  • So reloading the store changes the distribution of which instructions get blamed for cycles. But does it actually run faster overall, though? If so, that might indicate that the writer core is having its copy of the cache line invalidated sometimes, instead of just flip to Shared state (read only) which would still let loads read from it. Perhaps counters for `mem_load_retired.l1_miss` could shed some light. Or `l1d.replacement`, although IDK if that would count just invalidation without actual replacement. – Peter Cordes Aug 01 '22 at 18:56
  • A 50 iteration `nop` loop seems weird. The compiler didn't even unroll it so you have an inner loop that probably mispredicts on the final iteration. But even with that, still getting most of the cycles spent on that load seems to indicate there's something funny going on with it. – Peter Cordes Aug 01 '22 at 19:01

0 Answers0