I have a Java Servlet based application running on Apache Tomcat on two different machines with similar hardware (RAM, SSD disk, network interface and bandwidth) but different CPU architectures:
- x86_64
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6266C CPU @ 3.00GHz
Stepping: 7
CPU MHz: 3000.000
BogoMIPS: 6000.00
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 30976K
NUMA node0 CPU(s): 0-7
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat avx512_vnni md_clear flush_l1d arch_capabilities
- aarch64
Architecture: aarch64
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 8
Socket(s): 1
NUMA node(s): 1
Vendor ID: 0x48
Model: 0
Stepping: 0x1
BogoMIPS: 200.00
L1d cache: 64K
L1i cache: 64K
L2 cache: 512K
L3 cache: 32768K
NUMA node0 CPU(s): 0-7
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm
I have experience profiling Java applications both for CPU and memory usage with tools like Yourkit, JProfiler and Async Profiler. And I think I've found all the obvious performance related problems in our application. Using Apache JMeter (5.3.0) I've created a test plan that simulates real case loading: 9000 virtual users navigate the application, with think time, ramp up time, etc. The JMeter reports for both machines look very similar - after all the tweaking and tuning I was able to reach 1200 requests per second with this JMeter plan. If I increase the number of virtual users or decrease the think time then JMeter starts reporting errors mostly related to timeouts (both connect and read timeouts).
So I've decided to use wrk. With it the client machine (the machine where the load test client runs at) uses much less resources and I was able to get much better throughput:
- around 40000 req/s when executing against the x86_64 machine
- around 20000 req/s when executing against the aarch64 machine
Now, my question is: How to find out what makes the x86_64 machine twice more performant than the aarch64 one ? What kind of tools would you use to find where is the difference ?
I've tried with perf tool but so far I cannot really grasp how to read and interpret its records.
One thing I know for sure is that it is not the network bandwidth because with iperf I can get 5.48 Gbits/sec, while wrk
reaches at most 220 MBit/sec (according to nload). If I am not wrong this is around 5 times below the maximum throughput.
All machines run on Ubuntu 18.04.4