1

I'm running a benchmark that measures bandwidth of a network device (connected with PCIE). The benchmark uses a communication framework for transmitting the messages. Each send operation is represented by send request.

Benchmark code

Both sender and receiver has the same setup with same CPU/RAM/network card/OS/SW/Drivers/etc Test runs on single core on each peer with only one thread (one on the sender side and one on the receiver side). The test has a warmup stage (not measuring results just executing). Process are pinned to specific core based on system topology (NUMA distances).

I've noticed that I'm getting different results on AMD based machine compare to Intel based machine. That is expected but the strange thing is that the AMD CPU (EPYC 7443) suppose to be faster than the Intel (E5-2680 v4), but the Intel based system gives 20% higher bandwidth.

I've counted the cache misses on both systems (on the sender) and I received high rate of cache misses on the AMD machine compare to the Intel machine. I'm not sure why it happen.

AMD
Performance counter stats for process id '1103957':

     **4,064,775,312      cache-references                                              (41.65%)
       293,339,910      cache-misses              #    7.217 % of all cache refs**      (41.65%)
     2,773,241,736      L1-dcache-load-misses                                         (41.65%)
         1,105,886      L1-icache-load-misses                                         (41.66%)
    97,199,735,971      cpu-cycles                                                    (41.67%)
           138,263      r8ae                                                          (41.68%)
    60,775,752,753      r4ae                                                          (41.68%)
     2,044,760,199      r2ae                                                          (41.68%)
             2,012      r1ae                                                          (41.68%)
        16,147,477      r187                                                          (41.68%)
       179,557,410      r287                                                          (41.67%)
    76,295,394,278      r487                                                          (41.66%)

      25.000407397 seconds time elapsed

Intel
Performance counter stats for process id '109233':

     **2,025,418,233      cache-references
           362,591      cache-misses              #    0.018 % of all cache refs**
     2,967,129,952      L1-dcache-load-misses
        12,114,610      L1-icache-load-misses
    77,263,379,357      cycles
    26,532,945,498      resource_stalls.any
    10,038,121,055      resource_stalls.sb

      25.297898440 seconds time elapsed

The memory footprint is also different: Intel total 626452K AMD total 469760K

I've also measured on the sender side the lifetime (from creation to completion) of each request (using clock()), and I got some strange results that I'm not sure how to interpret. I've attached a graph with 5000 sampled requests. Maybe it is good to mention that the requests can go into pending queue when the device is "out of resource" and this queue is implemented in SW.

It's seems that the CPUs works differently. I know that there are differences in the architecture of the CPU (Intel is monolithic), and probably other differences that could explain this strange results, but I'm not CPU architect or compiler expert so I can't tell exactly what I'm seeing and why I'm getting this results. If you can help me and explain me what might cause this and what can be improved? Also I don't understand how it's possible that great amount of requests on the Intel machine shows 0 cycles lifetime, is it a measurement error?

Requests lifetime in cycles (sampled 5000 requests)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Brave
  • 159
  • 11
  • 1
    Many details are missing to understand what is going on. Modern CPU are very complex so it is generally pretty hard to do correct wild guesses. The most important missing thing is the benchmark code : what the code exactly do and is it parallel? Then, how it is actually run : are threads pinned to cores and what is the target hardware topology of the CPU and the network card? Besides, AFAIK, [`clock()`](https://en.cppreference.com/w/c/chrono/clock) is not reliable for parallel code and it is not even meant to compute the wall-clock time. The result is certainly biased due to a theshold. – Jérôme Richard Mar 26 '23 at 13:10
  • I added the missing details (I think). Which method you think I should use to measure the requests lifetime instead of `clock()`? – Brave Mar 26 '23 at 13:17
  • When it comes to "cache-misses", I am not sure what this means (e.i. which cache and in which context) nor if this is comparable between the two architecture. I think it would be interesting to provide the L2 and L3 hit/misses. I find it weird that the number of cache-references is not the same (though it can just be due to something like a spin-lock). – Jérôme Richard Mar 26 '23 at 13:19
  • 1
    Thank you, `clock_gettime`+`CLOCK_MONOTONIC` should be relatively good I guess in C. I personally use the steady-clock in C++. If the measured time is very small, then using `rdtscp` is certainly the best option, but you need to care about the behaviour of this instruction on the target architecture since it is not always the same (AFAIK, `rdtscp` provide frequency-independent timings on new Intel CPU, but not on old ones, so frequency scaling can result in big biases; IDK for AMD). – Jérôme Richard Mar 26 '23 at 13:22
  • 1
    @JérômeRichard: The TSC is fixed-frequency on modern AMD as well, so it's also wall-clock time not `cpu-cycles`. I know Intel Xeons can do DMA into L3 cache; if AMD can't also do that, an I/O bandwidth workload could suffer because of a round trip to DRAM. (Hmm, with AMD having multiple separate L3 domains, DMA would have to figure out which on to write into if they were going to implement this...) – Peter Cordes Mar 26 '23 at 15:36
  • 1
    @JérômeRichard regarding cache miss definition here you can find it: https://stackoverflow.com/questions/12601474/what-are-perf-cache-events-meaning From this definition it's seems that it's LLC but I'm not sure if we can compare it between the machines. Regarding lower cache references on the Intel, that it's actually make sense if it show better performance don't you think? I will perform another measurement and update my post later I hope. – Brave Mar 26 '23 at 18:56

1 Answers1

0

I performed measurements using clock_gettime as @JérômeRichard advised and I got different results, so I think he is right and clock() is not the right way to do it.

Regarding what causing the difference, I still don't have an exact answer but I don't think that the Intel really do a better job here... Because CPUs today are so complex I've decided to take a different approach. I've forced the communication framework to use a different protocol in both the test machines. The new protocol I chose actually uses the CPU more (more copies), so if the Intel is really faster you would expect to see it here as well, but in this case the AMD based machine gave much better performance and not only from the Intel but from the previous "fast" protocol as well...

Both protocols has more or less same "function calls footprint"

Previous protocol map some memory region (only once) and share it with the network device, so the framework don't need to perform a second copy from the data buffer to the device. The new protocol copy the buffer to the device memory.

Now I need to understand why the first protocol gives lower performance on AMD, but I also know that the Intel is not really faster/use the cache better/etc

Brave
  • 159
  • 11
  • Introducing a bunch of extra copying work shows that a modern AMD can do that faster, no surprise there with their fast per-CCX L3 caches, and probably wider paths between L1d and L2 than Broadwell, as well as larger L2 caches. But that extra copying work wasn't part of the original benchmark. If you just time that more accurately, with higher-resolution clocks, that could be interesting if it turns out that an old Intel can do better with DMA into L3 cache. – Peter Cordes Mar 28 '23 at 19:28
  • There are many factors that need to be considered. I'm still not sure that the CPU alone is responsible for the different I saw before. I will investigate why it happen and can update my answer if I find the exact reason. You have some reason to believe that an old Intel can do DMA to L3 faster besides this results? – Brave Mar 28 '23 at 19:32
  • DMA into L3 has been a feature Intel's advertized in their Xeons for years, not just Skylake-X (Scalable Xeon) when they changed to non-inclusive L3 with a mesh interconnect. Sometimes called DCA (Direct Cache Access), or a more recent article calls it DDIO in the context of Xeon Scalable which has a different uncore architecture (https://www.intel.com/content/www/us/en/docs/vtune-profiler/cookbook/2023-0/effective-utilization-of-intel-ddio-technology.html). https://dl.acm.org/doi/pdf/10.1145/3508042 is an article about reverse-engineering the details in Broadwell and Skylake Xeons. – Peter Cordes Mar 28 '23 at 20:46
  • 1
    Hrm, or maybe DCA was an old thing and DDIO is the new thing? Someone who says they implemented a NIC driver using it (https://news.ycombinator.com/item?id=12778257) makes a distinction in naming, although that 2022 paper describes different terminology. Anyway, I'm certain that Intel Xeons for some years before Broadwell can do DMA into L3 cache, whatever they call it and whatever the fine details are, and that this can be important for 10Gbe and 100Gbe ethernet, and other high-bandwidth I/O. I don't know if AMD has a similar capability. – Peter Cordes Mar 28 '23 at 20:50
  • I'm not sure the way the PCI card access the data is the only reason for the performance difference. I run a similar benchmark that calls the communication framework directly (not through OPEN MPI) and I got similar performance to the Intel based system. I performed another measurement with perf and I noticed that the send_func@plt takes big chunk of the CPU running time on the AMD machine. Is it a noise? – Brave Apr 13 '23 at 16:30
  • I have no way of knowing, with no details about what your code does or how your PCI hardware works. You might want to confirm that most PCI transfers are a full 64 bytes at a time; there should be performance counters for that. – Peter Cordes Apr 13 '23 at 20:29
  • I found this article https://dl.acm.org/doi/pdf/10.1145/3508042 Maybe what you mentioned before about the DDIO is the reason for the performance difference. According to this article (March 22) Intel is the only vendor that support DDIO and it can give big performance impact when using Zcopy protocol. – Brave Apr 24 '23 at 12:44
  • That's the same article I linked in a previous comment. They say that DDIO is "one commercial implementation of DCA". So yes, Intel is the only vendor with DDIO, but only because other vendors use different names for DMA into L3. (I don't know whether AMD CPUs have a similar feature.) – Peter Cordes Apr 24 '23 at 12:54