2

I'm running very "simple" Test with.

@Fork(value = 1, jvmArgs = { "--illegal-access=permit", "-Xms10G", "-XX:+UnlockDiagnosticVMOptions", "-XX:+DebugNonSafepoints", "-XX:ActiveProcessorCount=7",
        "-XX:+UseNUMA"
        , "-XX:+UnlockDiagnosticVMOptions", "-XX:DisableIntrinsic=_currentTimeMillis,_nanoTime",

        "-Xmx10G", "-XX:+UnlockExperimentalVMOptions", "-XX:ConcGCThreads=5", "-XX:ParallelGCThreads=10", "-XX:+UseZGC", "-XX:+UsePerfData", "-XX:MaxMetaspaceSize=10G", "-XX:MetaspaceSize=256M"}
)
    @Benchmark
    public String generateRandom() {
        return UUID.randomUUID().toString();
    }

May be it's not very simple, because uses random, but same issue is on any other tests with java

On my home desktop

Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz 12 Threads (hyperthreading enabled ), 64 GB Ram, "Ubuntu" VERSION="20.04.2 LTS (Focal Fossa)"
Linux homepc 5.8.0-59-generic #66~20.04.1-Ubuntu SMP Thu Jun 17 11:14:10 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Performance with 7 threads:

Benchmark                                            Mode  Cnt        Score       Error   Units
RulesBenchmark.generateRandom                       thrpt    5  1312295.357 ± 27853.707   ops/s

Flame Graph with AsyncProfiler Result with 7 Thread At Home enter image description here

I have an issue on Oracle Linux

Linux  5.4.17-2102.201.3.el8uek.x86_64 #2 SMP Fri Apr 23 09:05:57 PDT 2021 x86_64 x86_64 x86_64 GNU/Linux
Intel(R) Xeon(R) Gold 6258R CPU @ 2.70GHz with 56 Threads(hyperthreading disabled, the same when enabled and there is 112 cpu threads ) and 1 TB RAM I have half of performance (Even increasing threads) NAME="Oracle Linux Server" VERSION="8.4"

with 1 thread, I have very great performance:

Benchmark                                            Mode  Cnt        Score      Error   Units
RulesBenchmark.generateRandom                       thrpt    5  2377471.113 ± 8049.532   ops/s

Flame Graph with AsyncProfiler Result 1 Thread enter image description here But with 7 thread

Benchmark                                            Mode  Cnt       Score       Error   Units


RulesBenchmark.generateRandom                       thrpt    5  688612.296 ± 70895.058   ops/s

Flame Graph with AsyncProfiler Result 7 Thread

enter image description here

May be it's an issue of NUMA becase there is 2 Sockets, and system is configured with only 1 NUMA node numactl --hardware

available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
node 0 size: 1030835 MB
node 0 free: 1011029 MB
node distances:
node   0 
  0:  10 

But after disabling some cpu threads using:

for i in {12..55}
do
 # your-unix-command-here
  echo '0'| sudo tee /sys/devices/system/cpu/cpu$i/online
done

Performance little improved, not much.

This is just very "simple" test. On complex test with real code, it's even worth, It spends a lot of time on .annobin___pthread_cond_signal.start

I also deployed vagrant image with the same version of Oracle Linux and kernel version on my home desktop and run it with 10 cpu threads, and performance was nearly as same (~1M op/sec) as on my descktop. So it's not about OS or kernel, but some configuration

Tested with several jDK versions and vendors (jdk 11 and above). It's very little performance when using OpenJDK 11 from YUM distribution, but not significant.

Can you sugest some advice Thanks in advance

Jon Heller
  • 34,999
  • 6
  • 74
  • 132

1 Answers1

2

In essense, your benchmark tests the throughput of SecureRandom. The default implementation is synchronized (more precisely, the default implementation mixes the input form /dev/urandom and the above provider).

The paradox is, more threads result in more contention, and thus lower overall performance, as the main part of the algorithm is under a global lock anyway. Async-profiler indeed shows that the bottleneck is the synchronization on a Java monitor: __lll_unlock_wake, __pthread_cond_wait, __pthread_cond_signal - all come from that synchronization.

The contention overhead definitely depends on the hardware, the firmware, and the OS configuration. Instead of trying to reduce this overhead (which can be hard, as, you know, some day will arrive yet another security patch that will make syscalls 2x slower, for example), I'd suggest to get rid of the contention in the first place.

This can be achieved by installing a different, non-blocking SecureRandom provider like shown in this answer. I won't give a recommendation on a particular SecureRandomSpi, as it depends on your specific requirements (throughput/scalability/security). Will just mention that an implementation can be based on

apangin
  • 92,924
  • 10
  • 193
  • 247
  • Andrei, Thank you so much for taking time and look at my question. As I said it's only "simple test". I have "securerandom.source=file:/dev/urandom". If you look at https://stackoverflow.com/questions/67845210/drools-performance same issue is there. Any test I do on this server, I got nearly 3X low performance – Dimitri Gamkrelidze Jul 02 '21 at 21:57
  • @DimitriGamkrelidze Unlike [the linked question](https://stackoverflow.com/questions/67845210/drools-performance), this one looked specific enough - I hope I answered about the specific problem of SecureRandom scalability. If your original problem isn't about RNG, this means, your [minimal example](https://stackoverflow.com/help/minimal-reproducible-example) isn't minimal enough. Start with "2+2" test. Is it also really 3x slower? – apangin Jul 03 '21 at 00:40
  • You are right, this example is not "simple" enought to measure performance, if you saw , I wrote that "May be it's not very simple, because uses random, but same issue is on any other tests with java ". Thank you for answer. Here is gist https://gist.github.com/ditogam/eab68c5ea69f49e6decd6ed85952df9b and you'll see that with AtomicInteger it does not have any problem, sith synchronized and ReentrantLock blocks there is significant different results – Dimitri Gamkrelidze Jul 03 '21 at 15:30
  • Anyway. Thank you for taking time to detailed answer, and you answered the way, what you saw. It's my fault I provided bad example for question :) – Dimitri Gamkrelidze Jul 03 '21 at 15:34
  • 1
    @DimitriGamkrelidze In fact, one good conclusion can be made from your benchmark: *atomics scale much better than locks*. There is no much sense in running an inherently single-threaded test case in 7 threads, other than for getting a proof that the lock contention sucks. You probably wanted to know why it sucks more on a server than on a home laptop, but in my opinion, this does not matter. This can be CPU frequency, memory access costs, syscall auditing or other random stuff - doesn't matter. Contention is bad by definition, so try to reduce contention, or better, get rid of it entirely. – apangin Jul 04 '21 at 14:59