I'm running a benchmark that measures bandwidth of a network device (connected with PCIE). The benchmark uses a communication framework for transmitting the messages. Each send operation is represented by send request.
Both sender and receiver has the same setup with same CPU/RAM/network card/OS/SW/Drivers/etc Test runs on single core on each peer with only one thread (one on the sender side and one on the receiver side). The test has a warmup stage (not measuring results just executing). Process are pinned to specific core based on system topology (NUMA distances).
I've noticed that I'm getting different results on AMD based machine compare to Intel based machine. That is expected but the strange thing is that the AMD CPU (EPYC 7443) suppose to be faster than the Intel (E5-2680 v4), but the Intel based system gives 20% higher bandwidth.
I've counted the cache misses on both systems (on the sender) and I received high rate of cache misses on the AMD machine compare to the Intel machine. I'm not sure why it happen.
AMD
Performance counter stats for process id '1103957':
**4,064,775,312 cache-references (41.65%)
293,339,910 cache-misses # 7.217 % of all cache refs** (41.65%)
2,773,241,736 L1-dcache-load-misses (41.65%)
1,105,886 L1-icache-load-misses (41.66%)
97,199,735,971 cpu-cycles (41.67%)
138,263 r8ae (41.68%)
60,775,752,753 r4ae (41.68%)
2,044,760,199 r2ae (41.68%)
2,012 r1ae (41.68%)
16,147,477 r187 (41.68%)
179,557,410 r287 (41.67%)
76,295,394,278 r487 (41.66%)
25.000407397 seconds time elapsed
Intel
Performance counter stats for process id '109233':
**2,025,418,233 cache-references
362,591 cache-misses # 0.018 % of all cache refs**
2,967,129,952 L1-dcache-load-misses
12,114,610 L1-icache-load-misses
77,263,379,357 cycles
26,532,945,498 resource_stalls.any
10,038,121,055 resource_stalls.sb
25.297898440 seconds time elapsed
The memory footprint is also different: Intel total 626452K AMD total 469760K
I've also measured on the sender side the lifetime (from creation to completion) of each request (using clock()
), and I got some strange results that I'm not sure how to interpret. I've attached a graph with 5000 sampled requests. Maybe it is good to mention that the requests can go into pending queue when the device is "out of resource" and this queue is implemented in SW.
It's seems that the CPUs works differently. I know that there are differences in the architecture of the CPU (Intel is monolithic), and probably other differences that could explain this strange results, but I'm not CPU architect or compiler expert so I can't tell exactly what I'm seeing and why I'm getting this results. If you can help me and explain me what might cause this and what can be improved? Also I don't understand how it's possible that great amount of requests on the Intel machine shows 0 cycles lifetime, is it a measurement error?