2

I need to calculate MD5 checksum for many large files. The code for this is pretty simple:

System.IO.FileStream file = new System.IO.FileStream(strFullPath, FileMode.Open);
fsFile.Seek(1000, SeekOrigin.Begin);    //skip some chars if need be
System.Security.Cryptography.MD5 md5 = new System.Security.Cryptography.MD5CryptoServiceProvider();
byte[] arrBtMd5 = md5.ComputeHash(fsFile);

The problem starts, if I want to do one of the following:

  • Calculate several hash functions for the same file (md5,sha1,crc32 and-what-not).
  • Calculate MD5 for entire file and another MD5 for the same file with some header rows skipped.

If I do this one by one, the same file will be read multiple times. Disk I/O is a bottleneck of the system, so my questions are:

  1. Can .NET compiler/framework recognize that I read the same file multiple times and optimize the operation? (I'm pretty sure it does something, because when I added second md5 calculation without headers, the impact was not that great).
  2. What technique can I use to be sharing same FileStream between multiple "consumers"? I'd like to skim a file only once with a FileStream and split the data for use by hashing functions working paralelly.
AdamL
  • 12,421
  • 5
  • 50
  • 74
  • 1
    [Yet another solution](https://codereview.stackexchange.com/questions/244314/reading-one-source-stream-by-multiple-consumers-asynchronously) – aepot Jun 22 '20 at 12:28

2 Answers2

2

I agree with Henk Holtermans Response. You'll have to do the split yourself. What you can do however is to not calculate the complete hash with a single ComputeHash call, but to do it in chunks of Bytes with TransformBlock calls. See here for an example.

With this you can yourself instantiate a buffer of a size and give it as Parameters to the consequent parallel TransformBlock Calls.

Edit: here is some code that does the Job

        static void Hash2Md5inParallel()
    {
        string strFullPath = YourFilePathGoesHere;
        byte[] Buffer = new Byte[1000]; //Instantiate Buffer to copy bytes.
        byte[] DumpBuffer = new Byte[1000];  //Send output to bin.

        System.Security.Cryptography.MD5 md5_1 = new System.Security.Cryptography.MD5CryptoServiceProvider();
        System.Security.Cryptography.MD5 md5_2 = new System.Security.Cryptography.MD5CryptoServiceProvider();


        System.IO.FileStream file = new System.IO.FileStream(strFullPath, FileMode.Open);
        file.Seek(1000, SeekOrigin.Begin);    //skip some chars if need be

        int BytesToHash = 0;
        do
        {

            BytesToHash = file.Read(Buffer, 0, 1000);


            md5_1.TransformBlock(Buffer, 0, BytesToHash, DumpBuffer, 0);

            //enter some code to skip some bytes for the other hash if you like...
            md5_2.TransformBlock(Buffer, 0, BytesToHash, DumpBuffer, 0);
        }
        while (BytesToHash > 0); //Repeat until no more bytes.

        //call TransformFinalBlock to finish hashing - empty block is enough
        md5_1.TransformFinalBlock(new byte[0], 0, 0);
        md5_2.TransformFinalBlock(new byte[0], 0, 0);

        //Get both Hashs.
        byte[] hash1 = md5_1.Hash;
        byte[] hash2 = md5_2.Hash;


    }
Community
  • 1
  • 1
Marwie
  • 3,177
  • 3
  • 28
  • 49
1

1 Can .NET compiler/framework recognize that I read the same file multiple times and optimize the operation? (I'm pretty sure it does something, because when I added second md5 calculation without headers, the impact was not that great).

No, but the underlying OS (Windows) will cache and buffer your file.

2 What technique can I use to be sharing same FileStream between multiple "consumers"? I'd like to skim a file only once with a FileStream and split the data for use by hashing functions working paralelly.

Afaik there are not 'streamsplitters' available, but you can read it into a MemoryStream and reuse that. But that would only work for fairly small files.

I would leave it to Windows and do nothing special.

You might experiment with running the Hashers in parallel, this is a rare situation in which parallel I/O on 1 disk might work.

H H
  • 263,252
  • 30
  • 330
  • 514