Comparing application performance between CPU architectures

Question

I have a Java Servlet based application running on Apache Tomcat on two different machines with similar hardware (RAM, SSD disk, network interface and bandwidth) but different CPU architectures:

x86_64

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6266C CPU @ 3.00GHz
Stepping:            7
CPU MHz:             3000.000
BogoMIPS:            6000.00
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            30976K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat avx512_vnni md_clear flush_l1d arch_capabilities

aarch64

Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  1
Core(s) per socket:  8
Socket(s):           1
NUMA node(s):        1
Vendor ID:           0x48
Model:               0
Stepping:            0x1
BogoMIPS:            200.00
L1d cache:           64K
L1i cache:           64K
L2 cache:            512K
L3 cache:            32768K
NUMA node0 CPU(s):   0-7
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm

I have experience profiling Java applications both for CPU and memory usage with tools like Yourkit, JProfiler and Async Profiler. And I think I've found all the obvious performance related problems in our application. Using Apache JMeter (5.3.0) I've created a test plan that simulates real case loading: 9000 virtual users navigate the application, with think time, ramp up time, etc. The JMeter reports for both machines look very similar - after all the tweaking and tuning I was able to reach 1200 requests per second with this JMeter plan. If I increase the number of virtual users or decrease the think time then JMeter starts reporting errors mostly related to timeouts (both connect and read timeouts).

So I've decided to use wrk. With it the client machine (the machine where the load test client runs at) uses much less resources and I was able to get much better throughput:

around 40000 req/s when executing against the x86_64 machine
around 20000 req/s when executing against the aarch64 machine

Now, my question is: How to find out what makes the x86_64 machine twice more performant than the aarch64 one ? What kind of tools would you use to find where is the difference ?

I've tried with perf tool but so far I cannot really grasp how to read and interpret its records.

One thing I know for sure is that it is not the network bandwidth because with iperf I can get 5.48 Gbits/sec, while wrk reaches at most 220 MBit/sec (according to nload). If I am not wrong this is around 5 times below the maximum throughput.

All machines run on Ubuntu 18.04.4

score -4 · Answer 1 · answered Jun 22 '20 at 16:33

-4

Looking into your own CPU information:

x64 -BogoMIPS: 6000.00
aarch64 - BogoMIPS: 200.00

And as per Wikipedia:

BogoMips (from "bogus" and MIPS) is a crude measurement of CPU speed made by the Linux kernel when it boots to calibrate an internal busy-loop.1 An often-quoted definition of the term is "the number of million times per second a processor can do absolutely nothing"

It's related to the CPU frequency so my expectation is that the ARM processor actual frequency is much lower. You can use sar tool or JMeter PerfMon Plugin in order to check both systems metrics (CPU, RAM, Swap, etc.), this way you will be able to tell for sure what is the bottleneck when it comes to ARM system.

With regards to the tool selection, JMeter is more "heavy" than wrk, however it us more powerful as well due to support of Cookies, Cache, working with embedded resources (parsing the response and automatically downloading images, scripts, styles, etc.)

answered Jun 22 '20 at 16:33

Dmitri T

159,985
5
83
133

2

BogoMIPS ratios across totally different architectures almost certainly tells us nothing about actual clock-speed ratios, or much of anything else. And/or the ARM one is probably artificially low for some reason related to running in a VM, or scaled differently. (That ancient wikipedia definition can't be accurate; 6 billion iterations per second for an empty loop isn't correct for the x86 CPU; it can't turbo to 6GHz. And 200 million iters/sec is implausibly low for an AArch64 you could use in the cloud, unless that's its very low idle clock speed.) – Peter Cordes Jun 22 '20 at 19:48
1

I expect that an AArch64 cloud VM instance is probably running on hardware that clocks up to somewhere around 1.5 to 2.5 GHz under full CPU load, while Xeon Gold 6266C has [a sustained frequency of 3GHz, max all-core turbo of 3.2GHz]https://en.wikipedia.org/wiki/List_of_Intel_Cascade_Lake-based_Xeon_microprocessors). The Xeon might have more out-of-order execution resources and/or larger or faster caches, too. It's not plausible that an 8-core AArch64 as 200MHz could run at half the speed of an 8-core Cascade Lake at 6GHz, according to your naive interpretation of BogoMIPS. – Peter Cordes Jun 22 '20 at 19:52

Comparing application performance between CPU architectures

1 Answers1