2

I'm trying to understand why hashing of multiple hashes is much slower than one hash. In the following test I hash a file twice - first with SHA1 and then with both SHA1 and SHA256. The first execution shows the expected results - disk read dominates the time used - both took around 30 seconds (the latter around a second less despite more work).

However, on subsequent executions I get a strange result: around 10 seconds for the first and 30 for the second. The 10 implying that the original disk read took 20 seconds, and the 30 implying that it took almost no time. Which, probably, really means that for some reason, hashing once is much quicker than twice. But why?

What's happening here?

The code:

Text = TestSpeed(new HashAlgorithm[] { new SHA1Managed() }, path);
Text += " " + TestSpeed(new HashAlgorithm[] { new SHA1Managed(), new SHA256Managed() }, path);

And:

public string TestSpeed(HashAlgorithm[] algorithms, string path)
{
    Stopwatch sw = new Stopwatch();
    sw.Start();

    byte[] block = new byte[65536];
    int bytesRead = 0;
    using (FileStream stream = new FileStream(path, FileMode.Open))
        while ((bytesRead = stream.Read(block, 0, block.Length)) > 0)
            foreach (HashAlgorithm algorithm in algorithms)
                algorithm.TransformBlock(block, 0, bytesRead, null, 0);
    foreach (HashAlgorithm algorithm in algorithms)
        algorithm.TransformFinalBlock(block, 0, 0);

    sw.Stop();
    return sw.Elapsed.ToString();
}
ispiro
  • 26,556
  • 38
  • 136
  • 291
  • You're reinventing `HashAlgorithm.ComputeHash`. http://referencesource.microsoft.com/#mscorlib/system/security/cryptography/hashalgorithm.cs,57 – SLaks Jun 15 '15 at 18:49
  • Sounds like caching, not hashing... first iteration has disk i/o that increases run time for test 1, for test 2 disk reads are cached. On subsequent iterations all reads are cached, returning the expected result. Really, you should profile this code if it's important - that would clearly reveal where your execution time is being spent and how it changes on subsequent iterations. – J... Jun 15 '15 at 18:49
  • Way too many things to count. Processor cache misses, other tasks on your machine interfering, disk file fragmentation, etc. Good profiling is *hard* – Jay Jun 15 '15 at 18:50
  • @SLaks This code was made from my original code which needs this step-by-step hashing. – ispiro Jun 15 '15 at 18:51
  • 1
    @Jay If the difference was small, or if subsequent executions weren't consistent - I'd agree. – ispiro Jun 15 '15 at 18:52
  • But it does seem consistent. Iteration 1, Test 1 : ~30 seconds (=20 sec disk read, 10s work). Iteration 1, Test 2 : 30 seconds (= 0s disk read, 30s work). Iteration 2, Test 1 : 10 seconds (= 0s disk read, 10s work). Iteration 2, Test 2 : 30s (=0s disk read, 10s work)... etc. This assumes Test1 and Test2 both operate on the same file data and that data would end up in the disk cache. – J... Jun 15 '15 at 18:55
  • How many iterations of the algorithm are being done for the SHA256 hash? Perhaps you're doing more compute than you thought. http://security.stackexchange.com/questions/3959/recommended-of-iterations-when-using-pkbdf2-sha256 – Jay Jun 15 '15 at 18:55
  • @J... You're absolutely right! I have no idea why I didn't notice that! You can transform your comment into an answer! Thanks! – ispiro Jun 15 '15 at 18:57
  • To eliminate some artifacts, you could read the file content into a memory block. Possibly, SHA256 is more complex (to be expected, as SHA1 is now deprecated), and therefore takes more CPU time to calculate. Try to use the same algorithm twice, i.e. SHA1 + SHA 1 instead of SHA1 + SHA256. – Remigius Stalder Jun 15 '15 at 18:57
  • @Jay Just one. See the code. But J... has already answered the question in the comment above. Thanks for your time. – ispiro Jun 15 '15 at 18:58

1 Answers1

3

Your results are most likely caused by disk caching. Assuming both tests operate on the same data, only the first read will result in significant I/O time. IE:

  • Iteration 1, Test 1 : 30 seconds (= 20s disk read, 10s work).
  • Iteration 1, Test 2 : 30 seconds (= 0s disk read, 30s work).
  • Iteration 2, Test 1 : 10 seconds (= 0s disk read, 10s work).
  • Iteration 2, Test 2 : 30 seconds (= 0s disk read, 30s work).
  • ...etc.

This means that hashing twice takes about three times longer than hashing once. Since SHA256 is typically about half as performant as SHA1, this seems a sensible result.

You can decouple this effect by first reading in the file to a memory stream before beginning the tests and using the memory stream for all reads during the test. Profiling would have shown you where your execution time was being spent.

J...
  • 30,968
  • 6
  • 66
  • 143
  • Thank! I have no idea how I missed that. – ispiro Jun 15 '15 at 19:05
  • By the way, this would imply that SHA256-ing a file takes as long as reading it from a drive (though a slow USB-connected one). To me - that's quite surprising. – ispiro Jun 15 '15 at 19:14
  • @ispiro That depends on the drive, the quality of the USB connection, the performance of the algorithm implementation, and the speed of your computer. A reasonably quick USB3.0 drive can deliver high sustained reads (~80-100MB/s) for a contiguous file. A mid-grade computer can SHA256 at about the same rate (again, ~100MB/s or so). Without profiling it's really hard to say... – J... Jun 15 '15 at 19:29