What is the fastest way to SHA-256 encode many short String values in Java (on an Intel CPU)?

Question

This question is slightly related to these two questions, but with these two differences: 1) I want to know how to hook specific Intel instructions from the JVM (hopefully via existing library) 2) I don't care about one large file, but millions of short (< 50 characters) String and Number objects.

I noticed that Intel provides native extensions (https://software.intel.com/en-us/articles/intel-sha-extensions) for creating SHA256 hashes. Is there any existing library in Java that can hook these native extensions? Is there a JVM implementation that natively hooks these extensions?

Is there a different implementation I should choose for millions of small String and Number values over a single giant file?

As a test, I tried 5 different hashing algorithms: Java built-in, Groovy built-in, Apache Commons, Guava, and Bouncy Castle. Only Apache and Guava seemed to push beyond 1 million hashes/sec on my Intel i5 hardware.

>groovy hash_comp.groovy
Hashing 1000000 iterations of SHA-256
time java: 2968         336927.2237196765 hashes/sec
time groovy: 2451       407996.7360261118 hashes/sec
time apache: 1025       975609.7560975610 hashes/sec
time guava: 901         1109877.9134295228 hashes/sec
time bouncy: 1969        507872.0162519045 hashes/sec

>groovy hash_comp.groovy
Hashing 1000000 iterations of SHA-256
time java: 2688         372023.8095238095 hashes/sec
time groovy: 1948       513347.0225872690 hashes/sec
time apache: 867        1153402.5374855825 hashes/sec
time guava: 953         1049317.9433368311 hashes/sec
time bouncy: 1890       529100.5291005291 hashes/sec

When I ran 10 times in a row, Apache Commons hashing was the consistent winner when hashing 1 million strings (it won 9/10 times). My test code is available here.

The question remains, is there a way to tap into the Intel SHA hashing extensions from the JVM?

UPDATE

As @MJM suggested in the comments, I have removed the String functions and tested purely on byte[] to byte[]. Here are sample results:

Hashing 1000000 iterations of SHA-256
time java: 674          1483679.5252225519 hashes/sec
time apache: 833        1200480.1920768307 hashes/sec
time guava: 705         1418439.7163120567 hashes/sec
time bouncy: 692        1445086.7052023121 hashes/sec

Updated code

If you want to "hook specific Intel instructions", then you need to write it in a lower-level language, e.g. assembler or maybe C, then [call that from Java](https://stackoverflow.com/q/5963266/5221149). — Andreas, Oct 16 '19 at 01:11
That's what I was afraid of. I didn't know if any JVM implementation gave a hook into this instruction set. — Scott, Oct 16 '19 at 01:16
@Andreas: Or maybe the JVM exposes it itself as a built-in. But if not, then yeah you want JNI (Java Native Interface) to call a C/C++ intrinsics version. That has pretty high per-call overhead so you probably want your native function to take a list / array of strings and produce an output array of hashes, or something like that. But yes, Sun/Oracle JVM and OpenJDK support JNI. — Peter Cordes, Oct 16 '19 at 01:51
Note that x86 SHA extensions are only available on a very set of CPUs. ([Are there in x86 any instructions to accelerate SHA (SHA1/2/256/512) encoding?](//stackoverflow.com/q/20692386)) e.g. AMD Ryzen, Intel CannonLake / IceLake, and Intel low-power Goldmont and later. There are no Xeon chips with SHA extensions yet, but maybe you're targeting AMD server chips or embedded Intel? Or upcoming IceLake chips? Currently IceLake is only out as a laptop chip, but desktop and eventually server chips will be coming. **What CPUs does this need to run on?** — Peter Cordes, Oct 16 '19 at 01:59
@PeterCordes, eventually it would be run on AWS boxes, which I believe are all Xeon -- Intel Xeon E5-2686 v4 (Broadwell) and Intel Xeon® Platinum 8175 processors with new Intel Advanced Vector Extension (AVX-512) instruction set — Scott, Oct 16 '19 at 02:14
Then you don't have SHA extensions for instructions like `SHA256RNDS2`. Skylake and Cascade-lake Xeon don't have it. I forget if AVX512BW helps with SHA256; many short strings probably doesn't benefit from 64-byte vectors but you can use AVX512's new shuffle instructions and merge-masking on 256-bit vectors (32-byte) if that helps. Using 512-bit vectors (64 bytes) would reduce max turbo (hurting performance for the rest of your program for a few milliseconds after the last 512-bit vector instructions.) — Peter Cordes, Oct 16 '19 at 02:18
Kind of curious why you want SHA256 values for millions of short (how short?) String and Number objects. — Jim Mischel, Oct 16 '19 at 03:28
When a library using Java’s builtin hashing implementation performs three times faster than Java’s builtin hashing implementation, it’s time to question your testing methodology. — Holger, Oct 25 '19 at 09:30
@Holger, My code is linked in the question and is pretty straight forward. Are you saying Google just wrapped Java's built in? I'm not sure that's correct. — Scott, Oct 27 '19 at 22:05
Well, Guava’s point is to abstract the hash operation so that you don’t need to worry about whether it has a custom implementation or delegates to Java’s builtin provider. [This version](https://github.com/google/guava/blob/master/guava/src/com/google/common/hash/Hashing.java) will delegate to `MessageDigestHashFunction` in case of SHA, which is a [wrapper around Java’s builtin abstraction](https://github.com/google/guava/blob/master/guava/src/com/google/common/hash/MessageDigestHashFunction.java), the `MessageDigest` class, hence the name. Feel free to check the version you’ve used… — Holger, Oct 28 '19 at 11:35
You seem to be testing 2 things at the same time here: the sha256 hash implementation and the hex encoding implementation. Given that the Apache Commons sha256 implementation just delegates to the java implementation, you would expect that to take the same amount of time as the Java. Since the Apache Commons is going so much faster, it's likely that your hex function is much slower than Apache's hex implementation. I'd suggest testing the hash functions and hex functions separately to get a true comparison. — MJM, Jun 22 '22 at 10:37
@MJM you are right, I was not purely testing hash performance as I was outputting a String. In my use case I need to get a string, but I should test those separately. I have rerun the tests and the results are much more similar. I have updated above. — Scott, Jul 12 '22 at 05:54

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

2

The fastest solution I found that made it simple to use native cryptographic functionality is Amazon Corretto Crypto Provider (ACCP).

https://aws.amazon.com/blogs/opensource/introducing-amazon-corretto-crypto-provider-accp/

https://github.com/corretto/amazon-corretto-crypto-provider

From Amazon:

What exactly is ACCP?

ACCP implements the standard Java Cryptography Architecture (JCA) interfaces and replaces the default Java cryptographic implementations with those provided by libcrypto from the OpenSSL project. ACCP allows you to take full advantage of assembly-level and CPU-level performance tuning, to gain significant cost reduction, latency reduction, and higher throughput across multiple services and products, as shown in the examples below.

edited Jun 20 '20 at 09:12

Community

1
1

answered Dec 03 '19 at 23:24

Scott

16,711
14
75
120

Since your question mentioned Intel SHA extensions: they're supported on AMD Ryzen, Intel CannonLake / IceLake, and Intel low-power Goldmont and later. There are no Xeon chips with SHA extensions until IceLake Xeon becomes available, but support exists in AMD server/desktop/laptop CPUs, and IceLake laptop CPUs are available now. [Are there in x86 any instructions to accelerate SHA (SHA1/2/256/512) encoding?](//stackoverflow.com/q/20692386) – Peter Cordes Dec 04 '19 at 00:05
Good point on the question calling out specific CPU functions. I guess the question was two parts 1) make hashing faster in Java and 2) can that Java code tap into any acceleration available within the CPU. To your point #2 is a much more nuanced and trickier question to answer. – Scott Dec 04 '19 at 06:15
I'd assume that ACCP will take advantage of dedicated SHA instructions on CPUs that support them. And if not, will use whatever SIMD is available, like x86 AVX2, or AArch64 AdvSIMD. It does explicitly say it can take advantage of "assembly-level" stuff. The only question would be if it can use a crypto accelerator *device* where access to it is more like a GPU. – Peter Cordes Dec 04 '19 at 06:42

What is the fastest way to SHA-256 encode many short String values in Java (on an Intel CPU)?

1 Answers1

Linked