12

I have run a simple test to measure the AES-GCM performance in Java 9, by encrypting byte buffers in a loop. The results were somewhat confusing. The native (hardware) acceleration seems to work - but not always. More specifically,

  1. When encrypting 1MB buffers in a loop, the speed is ~60 MB/sec for the first ~50 seconds. Then it jumps to 1100 MB/sec, and stays there. Does JVM decide to activate the hardware acceleration after 50 seconds (or 3GB of data)? can it be configured? Where can I read about the new AES-GCM implementation (besides here).
  2. When encrypting 100MB buffers, the hardware acceleration doesn't kick in at all. The speed is a flat 60 MB/sec.

My test code looks like this:

int plen = 1024*1024;
byte[] input = new byte[plen];
for (int i=0; i < input.length; i++) { input[i] = (byte)i;}
byte[] nonce = new byte[12];
...
// Uses SunJCE provider
Cipher cipher = Cipher.getInstance("AES/GCM/NoPadding");
byte[] key_code = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
SecretKey key = new SecretKeySpec(key_code, "AES");
SecureRandom random = new SecureRandom();

long total = 0;
while (true) {
  random.nextBytes(nonce);
  GCMParameterSpec spec = new GCMParameterSpec(GCM_TAG_LENGTH * 8, nonce);
  cipher.init(Cipher.ENCRYPT_MODE, key, spec);
  byte[] cipherText = cipher.doFinal(input);
  total += plen;
  // print delta_total/delta_time, once in a while
}

Feb 2019 update: HotSpot had been modified to address this issue. The fix is applied in Java 13, and also backported to Java 11 and 12.

https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8201633, https://hg.openjdk.java.net/jdk/jdk/rev/f35a8aaabcb9

July 16, 2019 update: The newly released Java version (Java 11.0.4) fixes this problem.

gidon
  • 271
  • 2
  • 8
  • forgot to mention - running on an Ubuntu 16 box, single CPU (Intel Skylake) with 8 cores. – gidon Feb 21 '18 at 12:01
  • 5
    I think your should read about [How to write a correct micro-benchmark in Java](https://stackoverflow.com/a/513259/452775) and apply the techniques that are described there. Them you will probably have a better idea about what you are measuring. You should for example include a warm up phase in your benchmark. The things you are seeing could be because the JIT compiler kicks in and optimises your code after 50 sec. – Lii Feb 21 '18 at 12:10
  • its not about benchmarking techniques. Do you know that this problem is due to JIT that kicks in after 50 sec / 3GB? If yes, its a useful information for me - it means that a process that starts, encrypts say 2GB of data, and ends - will never run at h/w speed. Seems harsh. Maybe its configurable, or there is another explanation. Also, why it never kicks in with 100MB files? Any info on this effect, JIT-caused or not, will be appreciated. – gidon Feb 21 '18 at 14:36
  • 4
    It seems the number of invocations matters for the optimization trigger and, of course, processing lots of small buffers implies more invocations than processing a few big buffers, considering the same amount of time. Mind the possibility to process a big buffer by repeatedly invoking `update` for a small portion of it and finally invoking `doFinal` for the last chunk… – Holger Feb 21 '18 at 17:36
  • 1
    @Lii You're right, warmup would most probably help, but this is no microbenchmark, it takes quite some time and should work better. – maaartinus Feb 21 '18 at 18:00
  • @maaartinus I just wonder if this has to do with intrinsics here, probably 50 sec is the time when the min number of required method invocations is hit, so that a certain method is intrinsified. I also also wonder if bigger buffers, would mean going to a different branch, where no intrinsics are possible, but I am speculating here – Eugene Feb 21 '18 at 20:11
  • Your question may not be about benchmarking techniques, but without proper benchmarking not much can be inferred. So someone has to write them anyway (or may have already written them during the development of that very feature!). – the8472 Feb 21 '18 at 23:21
  • 3
    @Eugene it’s not about taking different branches. I tried it with different buffer sizes and also varying buffer sizes. You can warm up the code in a second by executing it often enough with a tiny buffer to get the optimization, followed by calling the same code with a huge buffer, still benefiting from the already applied optimization. This indicates that it is merely the number of invocations that matters. When you refactor the code to always process an equally small part of the buffer via repeated `update` operations followed by `doFinal`, the total buffer size becomes irrelevant… – Holger Feb 22 '18 at 08:33
  • @Holger then this makes little sense, this has to be hardcoded somewhere in the AES code (or a flag that we don't know about), still very weird – Eugene Feb 22 '18 at 08:39
  • @Holger Thank you! That was it - splitting encryption into multiple updates solves this. Roughly, it takes 10,000 operations to warm the code up. I'm posting additional details below. – gidon Feb 22 '18 at 09:33
  • @Holger so it is about warming up then, 10_000 (roughly) being the limit when some method hits C2 compiler, when a certain method is replaced with an intrinsic call, where hardware acceleration would kick in. – Eugene Feb 22 '18 at 09:35
  • @Holger, Eugene: Correcting my numbers above - with additional experiments, it looks like the optimization starts earlier, after approximately 600 operations (~40 millsec with 4KB chunks, ~160 millisec with 16KB chunks). – gidon Feb 22 '18 at 10:08
  • @Holger Bad news, though. This works only for encryption. The decryption warms up only with doFinal operations, not with updates. Please let me know if you see a different picture in your environment. – gidon Feb 22 '18 at 13:54
  • 1
    @gg123 This should be IMHO reported as a bug. The encryption of huge blocks should be splitt automatically and for decryption some solution should be found. – maaartinus Feb 23 '18 at 16:44
  • 1
    @maaartinus You are right. I waited for Java 10 to see if this had been addressed, but the results are basically the same (with some workaround for the decryption - complex and unreliable..). Submitting a bug. – gidon Apr 12 '18 at 11:40

4 Answers4

10

Thanks @Holger for pointing in the right direction. Prepending cipher.doFinal with multiple cipher.update calls will trigger the hardware acceleration almost immediately.

Based on this reference, GCM Analysis , I'm using 4KB chunks in each update. Now both 1MB and 100MB buffers are encrypted at 1100 MB/sec speed (after a few dozen milliseconds) .

The solution is to replace

byte[] cipherText = cipher.doFinal(input);

with

int clen = plen + GCM_TAG_LENGTH;
byte[] cipherText = new byte[clen];

int chunkLen = 4 * 1024;
int left = plen;
int inputOffset = 0;
int outputOffset = 0;

while (left > chunkLen) {
  int written = cipher.update(input, inputOffset, chunkLen, cipherText, outputOffset);
  inputOffset += chunkLen;
  outputOffset += written;
  left -= chunkLen;
}

cipher.doFinal(input, inputOffset, left, cipherText, outputOffset);
Prags
  • 2,457
  • 2
  • 21
  • 38
gidon
  • 271
  • 2
  • 8
3

A couple of updates on this issue.

  1. The Java 10, released in late March, has the same problem, that can be bypassed with the same workaround - for data encryption only.

  2. The workaround basically doesn't work for data decryption - in both Java 9 and Java 10.

I've submitted a bug report to the Java platform. It had been evaluated and published as JDK-8201633.

gidon
  • 271
  • 2
  • 8
  • 2
    Apache PArquet team started a discussion on OpenJDK lists: http://mail.openjdk.java.net/pipermail/security-dev/2018-November/018745.html – eckes Nov 14 '18 at 21:16
2

This problem is fixed in Java 13. The fix is also backported to Java 11 and 12.

https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8201633, https://hg.openjdk.java.net/jdk/jdk/rev/f35a8aaabcb9

gidon
  • 271
  • 2
  • 8
0

The Java version, released on July 16, 2019 (Java 11.0.4), fixes this problem.

gidon
  • 271
  • 2
  • 8