3

I'm using Intel's SHA hardware acceleration instructions (sha256rnds2, etc, implementation here) and I have a speed around 30% slower than OpenSSL's software SHA256.

I'm doing a single SHA256 round (64 bytes), twice. As a comparision, I have around 100 M/s without SHA256 at all, 50 M/s with OpenSSL's SHA256 (two rounds of 64 bytes each) and 35 M/s using Intel's SHA instructions.

With 60 GHz (24 * 2.5 GHz [ * 2 HT]), that's around 600 cycles going to the two soft SHA256 rounds, while the same using accelerated instructions takes around 1100 cycles.

Is this expected?

zx485
  • 28,498
  • 28
  • 50
  • 59
Dae
  • 45
  • 6
  • How's the single-threaded performance? It seems unlikely that multiple cores compete with each other for a shared crypto unit, though. ([Agner Fog's measurements for Ryzen's `sha256rnds2` instruction are 1 uop with 4 cycle latency, one per 2 cycle throughput](http://agner.org/optimize/) The surrounding code you linked has a lot of shuffles, too.) – Peter Cordes Dec 23 '17 at 02:39
  • Did you compile with optimization enabled? What compiler / options did you use exactly? Does your benchmark fit purely in L1D / L1I cache of each core you're running it on? – Peter Cordes Dec 23 '17 at 02:40
  • 2
    Enabling optimisations revelaled some compiler warning for a buffer I was using wrong that was causing the speed penality. I now got to 85M/s using the hardware instructions, that is not perfect, but still pretty good. I believe I still might have some margin to optimise it further. Thank you Peter! (I'm sorry but it looks like I can't upvote your comment). – Dae Dec 23 '17 at 14:01
  • 1
    Benchmarking with optimization disabled is insane anyway. It hurts SIMD intrinsics just as much, if not more, than scalar code. https://stackoverflow.com/questions/32000917/c-loop-optimization-help-for-final-assignment/32001196#32001196 – Peter Cordes Dec 23 '17 at 15:36

0 Answers0