3

( I know that there have been a few somewhat related questions asked in the past, but I wasn't able to find a question regarding L1d cache misses and HyperThreading/SMT. )

After reading for a couple of days about some super interesting stuff like False Sharing, MESI/MOESI Cache Coherence Protocols , I decided to write a small "benchmark" in C (see below) , to test False Sharing in action.

I basically have an array of 8 doubles so that it fits in one Cache Line and two threads incrementing adjacent array positions.

At this point I should state that I am using a Ryzen 5 3600 , whose topology can be seen here .

I create two threads and then pin them two different logical cores, and each accesses and updates it's own array position, i.e Thread A updates array[2] and Thread B updates array[3] .

When I run the code using Hardware Threads #0 and #6 that belong to the same core , that (as shown in the topology diagram) share the L1d cache , the execution time is ~ 5 seconds.

When I use Threads #0 and #11 that don't have any caches in common , it takes ~ 9.5 seconds to complete. That time difference is expected because in this case there is "Cache Line Ping-Pong" going on.

However , and this is what bugs me, when I am using Threads #0 and #11 , the L1d cache misses are less than running with Threads #0 and #6.

My guess is that when using Threads #0 and #11 that don't have common caches, when one thread updates the contents of the shared cache line , according to the MESI/MOESI Protocol, the cache line in the other core get Invalidated. So even though there's a Ping-Pong going on, there aren't that many cache misses going on (compared to when running with Threads #0 and #6) , just a bunch of invalidations and cache line blocks transfers between the cores.

So, when using Threads #0 and #6 that have a common L1d cache , why are there more cache misses ?

(Threads #0 and #6 , also have common L2 cache, but I don't think it has any importance here because when the cache line gets invalidated, it has to be fetched in either from main memory (MESI) or from another core's cache(MOESI), so it seems impossible for the L2 to even have the data needed but also to be asked for it) .

Of course, when one thread writes into the L1d cache line , the cache line gets "dirty" but why does that matter? Shouldn't the other thread that resides on the same physical core have no problem reading the new "dirty" value?

TLDR: When testing False Sharing , there's approximately 3x more L1d cache misses when using two sibling-threads (threads that belong to the same physical core) , than when using threads that belong in two different physical cores. (2.34% vs. 0.75% miss rate, 396m vs. 118m absolute number of misses). Why is that happening?

(All statistics like L1d cache misses are measured using the perf tool in Linux. )

Also, minor secondary question , why are the sibling-threads paired in IDs 6 numbers apart? i.e Thread 0's sibling is Thread 6.. Thread i's sibling is Thread i+6. Does that help in any way? I've noticed this in both Intel and AMD CPUs.

I am super interested in Computer Architecture and I am still learning, so some of the above might be wrong, so sorry for that.

So, this is my code. Just creating two threads, binding them to specific logical cores and then hitting adjacent cache line locations.

#define _GNU_SOURCE

#include <stdio.h>
#include <sched.h>
#include <stdlib.h>
#include <sys/random.h>
#include <time.h>
#include <pthread.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>

struct timespec tstart, tend;
static cpu_set_t cpuset;


typedef struct arg_s
{
       int index;
       double *array_ptr;
} arg_t;

void *work(void *arg)
{
    pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
    int array_index = ((arg_t*)arg)->index;
    double *ptr = ((arg_t*)arg)->array_ptr;

    for(unsigned long i=0; i<1000000000; i++)
    {
            //it doesn't matter which of these is used
            // as long we are hitting adjacent positions
            ptr[array_index] ++;
            // ptr[array_index] += 1.0e5 * 4;
    }
    return NULL;
}

int main()
{
    pthread_t tid[2];

    srand(time(NULL));
    
    static int cpu0 = 0;
    static int cpu6 = 6; //change this to say 11 to run with threads 0 and 11

    CPU_ZERO(&cpuset);
    CPU_SET(cpu0, &cpuset);
    CPU_SET(cpu6, &cpuset);

    double array[8];

    for(int i=0; i<8; i++)
            array[i] = drand48();

    arg_t *arg0 = malloc(sizeof(arg_t));
    arg_t *arg1 = malloc(sizeof(arg_t));

    arg0->index = 0; arg0->array_ptr = array;       
    arg1->index = 1; arg1->array_ptr = array;


    clock_gettime(CLOCK_REALTIME, &tstart);

    pthread_create(&tid[0], NULL, work, (void*)arg0);
    pthread_create(&tid[1], NULL, work, (void*)arg1);

    pthread_join(tid[0], NULL);
    pthread_join(tid[1], NULL);


    clock_gettime(CLOCK_REALTIME, &tend);
 }

I am using GCC 10.2.0 Compiling as gcc -pthread p.c -o p

then running perf record ./p --cpu=0,6 or the same thing with --cpu=0,11 when using threads 0,6 and 0,11 respectively.

then running perf stat -d ./p --cpu=0,6 or the same with --cpu=0,11 for the other case

Running with Threads 0 and 6 :

Performance counter stats for './p --cpu=0,6':

           9437,29 msec task-clock                #    1,997 CPUs utilized          
                64      context-switches          #    0,007 K/sec                  
                 2      cpu-migrations            #    0,000 K/sec                  
               912      page-faults               #    0,097 K/sec                  
       39569031046      cycles                    #    4,193 GHz                      (75,00%)
        5925158870      stalled-cycles-frontend   #   14,97% frontend cycles idle     (75,00%)
        2300826705      stalled-cycles-backend    #    5,81% backend cycles idle      (75,00%)
       24052237511      instructions              #    0,61  insn per cycle         
                                                  #    0,25  stalled cycles per insn  (75,00%)
        2010923861      branches                  #  213,083 M/sec                    (75,00%)
            357725      branch-misses             #    0,02% of all branches          (75,03%)
       16930828846      L1-dcache-loads           # 1794,034 M/sec                    (74,99%)
         396121055      L1-dcache-load-misses     #    2,34% of all L1-dcache accesses  (74,96%)
   <not supported>     LLC-loads                                                   
   <not supported>     LLC-load-misses                                             

       4,725786281 seconds time elapsed

       9,429749000 seconds user
       0,000000000 seconds sys 

Running with Threads 0 and 11 :

 Performance counter stats for './p --cpu=0,11':

          18693,31 msec task-clock                #    1,982 CPUs utilized          
               114      context-switches          #    0,006 K/sec                  
                 1      cpu-migrations            #    0,000 K/sec                  
               903      page-faults               #    0,048 K/sec                  
       78404951347      cycles                    #    4,194 GHz                      (74,97%)
        1763001213      stalled-cycles-frontend   #    2,25% frontend cycles idle     (74,98%)
       71054052070      stalled-cycles-backend    #   90,62% backend cycles idle      (74,98%)
       24055983565      instructions              #    0,31  insn per cycle         
                                                  #    2,95  stalled cycles per insn  (74,97%)
        2012326306      branches                  #  107,650 M/sec                    (74,96%)
            553278      branch-misses             #    0,03% of all branches          (75,07%)
       15715489973      L1-dcache-loads           #  840,701 M/sec                    (75,09%)
         118455010      L1-dcache-load-misses     #    0,75% of all L1-dcache accesses  (74,98%)
   <not supported>      LLC-loads                                                   
   <not supported>      LLC-load-misses                                             

       9,430223356 seconds time elapsed

      18,675328000 seconds user
       0,000000000 seconds sys
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Zeosleus
  • 51
  • 4
  • 1
    Maybe related: [What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?](https://stackoverflow.com/q/45602699) – Peter Cordes Apr 19 '21 at 23:01
  • *paired in IDs 6 numbers apart?* - It's not a fixed 6, it's enumerating 1 core of each physical core, then coming back and doing the other. So using the low half of the core numbers is one thread per physical core. I don't know whether / how the BIOS influences how different kernels enumerate cores, but on some systems both logical cores of each physical core are enumerated by Linux before moving on, vs. this layout on other systems. Both are equally valid. – Peter Cordes Apr 20 '21 at 02:21
  • 1
    Can you include a [mcve] test case to see exactly what operations you're doing? – Peter Cordes Apr 20 '21 at 02:34
  • You said there are 2% more misses but your numbers show way more: 396m vs 118m? – BeeOnRope Apr 20 '21 at 16:49
  • @BeeOnRope Oops, just saw that .. you are right . I got confused and typed the 2% that's written on the output of the second perf stat. Still, what I am interested in finding out is why are there more cache misses when using sibling-threads. – Zeosleus Apr 20 '21 at 21:46
  • Why is your `main` messing around with `pthread_setaffinity_np` when you're already using `taskset`? Your perf results show that both runs utilized almost exactly 2.0 CPUs, so I assume you changed the source between runs, otherwise it would have overridden the initial affinity inherited from the parent process. Anyway, looks like over-complication that you could totally remove to make your MCVE more minimal. (And you forgot to `#include` headers so your example isn't complete either; I was going to try it on my Intel desktop but it doesn't compile.) – Peter Cordes Apr 21 '21 at 02:51
  • Note that `perf stat` doesn't care if you've run `perf record` or not; it runs the program, not parsing perf.data. And after including headers, it compiles but crashes while exiting with a stack-check failure (`*** stack smashing detected ***: terminated` at the bottom of main, after `pthread_join`). Oh, that's from your loop that writes array[0..15] after allocating `double array[8]`. No idea what the point of that is when you actually malloc space to pass to your threads. – Peter Cordes Apr 21 '21 at 02:55
  • On quad-core Skylake with `perf stat --all-user` I get 0.00% L1d misses when both threads are on the same physical core (as expected). On different cores, 2.19%. I guess the store buffer + store-forwarding is able to batch up some commits so it's not a total performance disaster. The only mystery in your results is why you see *any* L1d misses on the same core with false sharing. Maybe those are kernel accesses. – Peter Cordes Apr 21 '21 at 03:07
  • Oh, and your results show *more* misses when sharing a core, if that's truly accurate. (I removed the CPU_SET stuff from the source and just used taskset -c ; for your test results, the command line args are meaningless if you still have the source overriding affinity.) – Peter Cordes Apr 21 '21 at 03:10
  • @PeterCordes Excuse my ignorance, it's the first time I am messing around with thread affinities, but how is main messing around with `pthread_setaffinity_np` ? Main is just initializing a `cpuset`, which is a global variable so the pthreads can access it for their call to `pthread_setaffinity_np`. Also how am I already using `taskset`? I haven't changed the source at all, except when I want to run with two independent threads like #0 and #11 , as mentioned in a comment insied `main`, I just change the value of one of the `static int cpu` variables to be say 11 while the other one is 0. – Zeosleus Apr 21 '21 at 08:12
  • When you say almost 2.0 CPUs were used, do you mean 2 physical processors? Because when I run the program and observe which logical cores are used via htop, I only see two of them used, which is what I wanted. (I also included the headers now , sorry for skipping them before ) Just saw the loop that causes the stack smashing.. It was running 16 times because when I first tested the program I was using `ints` not doubles, so I needed 16 ints to fill-up one cache line. My results indeed show **more** misses when sharing a core, which is why I originally posted this question. – Zeosleus Apr 21 '21 at 08:18
  • I removed the `CPU_SET` stuff and tried it with `taskset `, but the cache misses were exactly the same. So which command line args are you reffering to , that are being overwritten by the source code? For example , the `--cpu=0,6` that's being passed in `perf stat` ? You mentioned that when using two sibling-threads it's a mystery why I am getting cache misses. So the expected behavior in this case , would be to have 0% cache misses like you said you observed on your machine? I am going to try the program in another CPU and observe the results. – Zeosleus Apr 21 '21 at 08:20
  • Yes, the `taskset --cpu=0,6 perf stat ...` would I think be overridden by your CPU_SET / `pthread_setaffinity_np` stuff, unless I'm misunderstanding how the system call works. Are you sure the sharing-a-physical-core case is the one with more cache misses? – Peter Cordes Apr 21 '21 at 09:01
  • For "almost 2.0 CPUs used", I was talking about "`task-clock # 1,982 CPUs utilized`". That's showing you CPU-time / wall-time, so it tells you that both threads must have been running on different logical cores to both use CPU time at once. It proves that CPU_SET and taskset -c didn't do an intersection of the allowed cores, because that would have left both threads only allowed to run on logical core #0. (In hindsight I don't think that was a real possibility, but since I think you must have been changing your source between runs it's still a useful sanity check.) – Peter Cordes Apr 21 '21 at 09:09
  • You don't need to init the whole array, only the 2 elements you actually use. And you could have just left it zero-initialized like `alignas(64) double array[8] = {0};`. You didn't align the array so your only guarantee (per the x86-64 System V ABI) is that it's 16-byte aligned. The whole arg-passing mechanism also seems pretty over-complicated; you could just pass each thread one `double *` to the array element you want it to increment. Instead of a pointer to a struct in dynamically-allocated storage. – Peter Cordes Apr 21 '21 at 09:12
  • 1
    @PeterCordes I ran the program on another machine that has an Intel i3-6100 CPU and the [results](https://imgur.com/a/AxxUjIM) were like the ones you saw yourself.. (Sorry for providing the data in images , but now I am away from that machine. ) That's definitely strange.. – Zeosleus Apr 21 '21 at 11:33
  • https://godbolt.org/z/x7TKfsE55 is a simplified version of your code that doesn't need the source changed to run with different `taskset --cpu=` options, or do any other over-complicated stuff. (And it uses `volatile double*` so you can compile with optimization without removing the memory access.) I'd recommend updating the code in your question to that. – Peter Cordes Apr 21 '21 at 18:33
  • The Intel results make perfect sense to me: the line stays hot in L1d cache of the physical core that both threads are sharing, so no cache misses. Otherwise it has to ping-pong. The AMD results are the weird thing. Makes me wonder if there's a same/different CCX effect going on, and whether you're ever *really* running on the same physical core at all. Make you you check how your own system / kernel enumerates your cores, if that graphic wasn't generated from your own system. – Peter Cordes Apr 21 '21 at 18:35
  • When I was first writting the program, I was expecting results like the Intel ones..no cache misses when using two sibling-threads, but I am indeed running on the same physical core, because I've double checked it in the sys pseudoFS. When I examined the contents of the `/sys/devices/system/cpu/cpu0/topology/thread_siblings_list` file , I got `0,6` which confirms that Thread **0** is a sibling of Thread **6**. I really can't understand what's happenning in my AMD CPU . – Zeosleus Apr 21 '21 at 18:58
  • Also, I always check which virtual cores are running with `htop` , and when I intend the program to run with Threads 0 and 6 , I indeed observe the virtual cores 0 and 6 going to 100% usage – Zeosleus Apr 21 '21 at 19:09
  • I had a friend , who has an Intel core i7-8750H , run the program and guess [what](https://imgur.com/a/pdk36aN) :P So it seems to be an Intel vs AMD thing.. – Zeosleus Apr 21 '21 at 20:20

1 Answers1

0

Perf does not support AMD yet for detail analysis of cache behavior (It partially does from Kernel 6.1). You can use uprof from amd to investigate more in detail. https://www.amd.com/en/developer/uprof.html

rjhcnf
  • 762
  • 6
  • 12
  • Are you saying the events `perf` does support give wrong counts, or the names are mapped to the wrong HW events so it's actually counting something else? Because it does reports counts for `L1-dcache-loads` and `L1-dcache-load-misses` on the OP's CPU. – Peter Cordes Apr 16 '23 at 07:51
  • No, I am not saying perf does not count or it gives wrong result but I want to say it does not support other metrics like you see in OP's post ' LLC-loads LLC-load-misses ' hence I was suggesting to use other tool to get more insight to analyze. – rjhcnf Apr 16 '23 at 07:55
  • Fair enough, useful info, but that doesn't really answer this specific question, unfortunately. The OP was primarily interested in the L1d counts, and used `perf stat -d` to get them, which also enables LLC counts. The actual question is about L1d miss counts in false sharing between logical cores on the same or different physical cores. LLC and L2 counts could perhaps figure out where memory traffic is going, but it's still weird that there'd be L1d misses at all in the case of sharing a physical core. – Peter Cordes Apr 16 '23 at 08:09
  • In a single multi-threaded core, each thread would need to maintain coherence with the other thread(s), probably the overhead is bigger than to maintain coherence with other thread(s) in other core for the specific AMD hardware. – rjhcnf Apr 16 '23 at 10:09
  • Yeah, on Intel CPUs you get lots of `machine_clears.memory_ordering`, like 360M vs. 40M, as discussed in comments on another Q&A: [What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?](https://stackoverflow.com/posts/comments/78210324) So it's probably slower on AMD as well, but why would that involve any L1d misses? – Peter Cordes Apr 16 '23 at 10:28
  • It is matter of what counts as l1d misses in PMC of the arch and the corresponding perf implementation. As you see, execution time is smaller in the same core case so maybe that's what is important. – rjhcnf Apr 16 '23 at 11:17