Using Intel's SHA256 instructions on AMD Epyc seems to be slower than not using hardware acceleration at all

Question

I'm using Intel's SHA hardware acceleration instructions (sha256rnds2, etc, implementation here) and I have a speed around 30% slower than OpenSSL's software SHA256.

I'm doing a single SHA256 round (64 bytes), twice. As a comparision, I have around 100 M/s without SHA256 at all, 50 M/s with OpenSSL's SHA256 (two rounds of 64 bytes each) and 35 M/s using Intel's SHA instructions.

With 60 GHz (24 * 2.5 GHz [ * 2 HT]), that's around 600 cycles going to the two soft SHA256 rounds, while the same using accelerated instructions takes around 1100 cycles.

Is this expected?

How's the single-threaded performance? It seems unlikely that multiple cores compete with each other for a shared crypto unit, though. ([Agner Fog's measurements for Ryzen's `sha256rnds2` instruction are 1 uop with 4 cycle latency, one per 2 cycle throughput](http://agner.org/optimize/) The surrounding code you linked has a lot of shuffles, too.) — Peter Cordes, Dec 23 '17 at 02:39
Did you compile with optimization enabled? What compiler / options did you use exactly? Does your benchmark fit purely in L1D / L1I cache of each core you're running it on? — Peter Cordes, Dec 23 '17 at 02:40
Enabling optimisations revelaled some compiler warning for a buffer I was using wrong that was causing the speed penality. I now got to 85M/s using the hardware instructions, that is not perfect, but still pretty good. I believe I still might have some margin to optimise it further. Thank you Peter! (I'm sorry but it looks like I can't upvote your comment). — Dae, Dec 23 '17 at 14:01
Benchmarking with optimization disabled is insane anyway. It hurts SIMD intrinsics just as much, if not more, than scalar code. https://stackoverflow.com/questions/32000917/c-loop-optimization-help-for-final-assignment/32001196#32001196 — Peter Cordes, Dec 23 '17 at 15:36

Using Intel's SHA256 instructions on AMD Epyc seems to be slower than not using hardware acceleration at all

0 Answers0