1

We are seeing a behavior where the performance of the JVM decreases when the load is light. Specifically on multiple runs, in a test environment we are noticing that the latency worsens by around 100% when the rate of order messages pumped into the system is reduced. Some of the background on the issue is below and I would appreciate any help on this.

Simplistically the demo Java trading application being investigated can be thought to have 3 important threads: order receiver thread, processor thread, exchange transmitter thread

Order receiver thread receives the order and puts it on a processor q. the processor thread picks it up from the processor q, does some basic processing and puts it on the exchange q. the exchange transmitter thread picks it up from exchange q and sends order to the exchange.

The latency from the order receipt to the order going out to the exchange worsens by 100% when the rate of orders pumped into the system is changed from a higher number to a low number.

Solutions tried:

  1. Warming up critical code path in JVM by sending high message rate and priming the system before reducing message rate: Does not solve the issue

  2. Profiling the application: Using a profiler it shows hotspots in the code where 10 -15% improvement may be had by improving the implementation. But nothing in the range of 100% improvement just obtained by increasing message rate.

Does anyone have any insights/suggestions on this? Could it have to do with the scheduling jitter on the thread.

Could it be that under the low message rate the threads are being switched out from the core?

2 posts I think may be related are below. However our symptoms are a bit different:

is the jvm faster under load?

Why does the JVM require warmup?

Rv_menon
  • 13
  • 2
  • Go through your design again, how are the messages being handed over from your receiver to processor and to transmitter plays important role in such situation. – Nitin Dandriyal Oct 03 '19 at 03:48

1 Answers1

1

Consistent latency for low/medium load requires specific tuning of Linux.

Below are few point from my old check list, which is relevant for components with millisecond latency requirements.

  • configure CPU core to always run and maximum frequency (here are docs for RedHat)
  • configure dedicated CPU cores for your critical application threads
    • Use isolcpus to exclude dedicated cores from scheduler
    • use taskset to bind critical thread to specific core
  • configure your service to run in single NUMA node (with numactl)

Linux scheduler and power sampling are key contributor to high variance of latency under low/medium low.

By default, CPU core would reduce frequency if inactive, as consequence your next request is processed slower on downclocked core.

CPU cache is key performance asset if your critical thread is scheduled on different cores it would lose its cache data. Also, other threads schedule for same core would evict cache also increasing latency of critical code.

Under heavy load these factors are less important (frequency is maxed and thread are ~100% busy tending to stick to specific cores).

Though under low/medium load these factors negatively affect both average latency and high percentiles (99 percentile may be order of magnitude worse compared to heavy load case).

For high throughput applications (above 100k request/sec) advanced inter thread communication approach (e.g. LMAX disruptor) are also useful.

Alexey Ragozin
  • 8,081
  • 1
  • 23
  • 23
  • The above detailed feedback was extremely helpful. We isolated cpus as recommended by you in our repeatable test environment. We did runs with the entire java process on a specific numa control node and also giving it a taskset of specific cores. Unfortunately, these did not improve the latency under lighter load. I had 2 questions: 1. How is possible to only bind a certain thread within a java process to a specific core using taskset 2. any inputs on how to configure CPU core to always run and maximum frequency. – Rv_menon Oct 04 '19 at 14:34
  • @Rv_menon could you share numbers such observed latency and request rate? – Alexey Ragozin Oct 04 '19 at 14:36
  • Also based on your suggestions we are now looking at using open hft thread affinity (https://github.com/OpenHFT/Java-Thread-Affinity) to bind the critical processor thread to a core . – Rv_menon Oct 04 '19 at 14:37
  • @Rv_menon you can use taskset with Java thread id (e.g. parsed from thread dump), tough github.com/OpenHFT/Java-Thread-Affinity is doing just that – Alexey Ragozin Oct 04 '19 at 14:54
  • CPU frequency become a bit messy with various distros. RedHat version https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/power_management_guide/cpufreq_governors – Alexey Ragozin Oct 04 '19 at 14:57
  • Alexey Rgozin : Your analysis and inputs were spot on. On our demo application, by switching from Red Hat Enterprise Linux 6 to 7 we were able to reduce the latency from around 400 microseconds to around 200 microseconds under light load. Another thing that changed was that we moved the application from an Intel Haswell to an Intel Broadwell machine along with the move from RHEL OS 6 - 7 . Does this make sense to you why we are seeing such a great improvement in performance with the same code and same runs with this move ? Thanks again for all your valuable inputs. – Rv_menon Oct 11 '19 at 20:36