Assume that we have multiple threads accessing the same cache line parallelly. One of them (writer) repeatedly write to that cache line, and read from it. The other threads (readers) only repeatedly read from it. I am clear that the readers suffer from performance degradation since the invalidation-based MESI protocol requires the readers to invalidate their local cache before the writer writes to it, which occurs frequently. But I think the writer's read to that cache line should be fast because local write will not cause such invalidation.
However, the strange thing is that, when I run such experiment on a dual-socket machine with two Intel Xeon Scalable Gold 5220R processors (24 cores each) running at 2.20GHz, the writer's read to that cache line becomes a performance bottleneck.
This is my test program (compiled with gcc 8.4.0, -O2):
#define _GNU_SOURCE
#include <stdio.h>
#include <pthread.h>
#include <sched.h>
#include <unistd.h>
#include <sys/syscall.h>
#define CACHELINE_SIZE 64
volatile struct {
/* cacheline 1 */
size_t x, y;
char padding[CACHELINE_SIZE - 2 * sizeof(size_t)];
/* cacheline 2 */
size_t p, q;
} __attribute__((aligned(CACHELINE_SIZE))) data;
static inline void bind_core(int core) {
cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(core, &mask);
if ((sched_setaffinity(0, sizeof(cpu_set_t), &mask)) != 0) {
perror("bind core failed\n");
}
}
#define smp_mb() asm volatile("lock; addl $0,-4(%%rsp)" ::: "memory", "cc")
void *writer_work(void *arg) {
long id = (long) arg;
int i;
bind_core(id);
printf("writer tid: %ld\n", syscall(SYS_gettid));
while (1) {
/* read after write */
data.x = 1;
data.y;
for (i = 0; i < 50; i++) __asm__("nop"); // to highlight bottleneck
}
}
void *reader_work(void *arg) {
long id = (long) arg;
bind_core(id);
while (1) {
/* read */
data.y;
}
}
#define NR_THREAD 48
int main() {
pthread_t threads[NR_THREAD];
int i;
printf("%p %p\n", &data.x, &data.p);
data.x = data.y = data.p = data.q = 0;
pthread_create(&threads[0], NULL, writer_work, 0);
for (i = 1; i < NR_THREAD; i++) {
pthread_create(&threads[i], NULL, reader_work, i);
}
for (i = 0; i < NR_THREAD; i++) {
pthread_join(threads[i], NULL);
}
return 0;
}
I use perf record -t <tid>
to collect cycles
event for the writer thread and use perf annotate writer_work
to show detailed proportion in writer_work
:
: while (1) {
: /* read after write */
: data.x = 1;
0.20 : a50: movq $0x1,0x200625(%rip) # 201080 <data>
: data.y;
94.40 : a5b: mov 0x200626(%rip),%rax # 201088 <data+0x8>
0.03 : a62: mov $0x32,%eax
0.00 : a67: nopw 0x0(%rax,%rax,1)
: for (i = 0; i < 50; i++) __asm__("nop");
0.03 : a70: nop
0.03 : a71: sub $0x1,%eax
5.17 : a74: jne a70 <writer_work+0x50>
0.15 : a76: jmp a50 <writer_work+0x30>
It seems that the load instruction of data.y
becomes the bottleneck.
First, I think the performance degradation should be related to the cache line modification, because when I comment out the writer's store operation, the perf
result indicates that the writer's read is not the bottleneck any more. Second, I think it should have something to do with the concurrent readers, because when I comment out the readers' read, writer's read is not blamed by perf
. Concurrent readers accessing the same cacheline also slow down the writer, because when I change sum += data.y
to sum += data.z
in the writer side, which leads to a load to another cache line, the count decreases.
There is another possibility, suggested by Peter Cordes, that it's whatever instruction follows the store that's getting the blame. However, when I move the nop loop before the load, perf
still blames for the load instruction.
: writer_work():
: while (1) {
: /* read after write */
: data.x = 1;
0.00 : a50: movq $0x1,0x200625(%rip) # 201080 <data>
0.03 : a5b: mov $0x32,%eax
: for (i = 0; i < 50; i++) __asm__("nop");
0.03 : a60: nop
0.09 : a61: sub $0x1,%eax
6.24 : a64: jne a60 <writer_work+0x40>
: data.y;
93.60 : a66: mov 0x20061b(%rip),%rax # 201088 <data+0x8>
: data.x = 1;
0.02 : a6d: jmp a50 <writer_work+0x30>
So my question is, what caused the performance degradation when reading a local modified cache line with concurrent readers accessing it? Any help would be appreciated!