5

I'm trying to design a simple application to be used for calculating a file's CRC32/md5/sha1/sha256/sha384/sha512, and I've run into a bit of a roadblock. This is being done in C#.

I would like to be able to do this as efficiently as possible, so my original thought was to read the file into a memorystream first before processing, but I soon found out that very large files cause me to run out of memory very quickly. So it would seem that I have to use a filestream instead. The problem, as I see it, is that only one hash function can be run at a time, and doing so with a filestream will take a while for each hash to complete.

How might I go about reading a small bit of a file into memory, processing it with all 6 algorithms, and then going onto another chunk... Or does hashing not work that way?

This was my original attempt at reading a file into memory. It failed when I tried to read a CD image into memory prior to running the hashing algorithms on the memorystream:

    private void ReadToEndOfFile(string filename)
    {
        if (File.Exists(filename))
        {
            FileInfo fi = new FileInfo(filename);
            FileStream fs = new FileStream(filename, FileMode.Open, FileAccess.Read);
            byte[] buffer = new byte[16 * 1024];

            //double step = Math.Floor((double)fi.Length / (double)100);

            this.toolStripStatusLabel1.Text = "Reading File...";
            this.toolStripProgressBar1.Maximum = (int)(fs.Length / buffer.Length);
            this.toolStripProgressBar1.Value = 0;

            using (MemoryStream ms = new MemoryStream())
            {
                int read;
                while ((read = fs.Read(buffer, 0, buffer.Length)) > 0)
                {
                    ms.Write(buffer, 0, read);
                    this.toolStripProgressBar1.Value += 1;
                }

                _ms = ms;
            }
        }
    }
Mirrana
  • 1,601
  • 6
  • 28
  • 66

3 Answers3

4

Hash algorithms are designed in a way that you can calculate the hash value incrementally. You can find a C#/.NET example for that here. You can easily modify the provided code to update multiple hash algorithm instances in each step.

Community
  • 1
  • 1
dtb
  • 213,145
  • 36
  • 401
  • 431
4

You're most of the way there, you just don't need to read the whole thing into memory at once.

All of the hashes in .Net derive from the HashAlgorithm class. This has two methods on it: TransformBlock and TransformFinalBlock. So, you should be able to read a chunk for your file, stuff it into the TransformBlock method of whichever hashes you want to use, and then move into the next block. Just remember to call TransformFinalBlock for your last chunk from the file, as that is what gets you the byte array containing the hash.

For now, I would just do each hash one at a time, until it's working, then worry about running the hashes concurrently (using something like the Task Parallel Library)

Matt Sieker
  • 9,349
  • 2
  • 25
  • 43
  • I've tried getting this to work using MD5, and the program runs, though it appears to be generating incorrect hashes. Here's a link to my code: [link](http://pastebin.com/i3iPwYZv) – Mirrana Apr 26 '12 at 23:52
  • 1
    You should be using `read` instead of `buffer.Length` when calling `TransformFinalBlock` – Matt Sieker Apr 27 '12 at 05:41
  • Thanks a lot! I was agonizing over this for a while last night. Ended up hacking together something stupid looking to get it to work, but I couldn't help but feel that it was unnecessary. I found out that it was because the last array was being fully read even when the last chunk was too small for it. I ended up making it create a new byte array for the last piece to equal the size of the last chunk. – Mirrana Apr 27 '12 at 11:50
-1

This might be a great opportunity to get your feet wet with the TPL data flow objects. Read the file in one thread and post the data to a BroadcastBlock<T>. The BroadcastBlock<T> will be linked to 6 different ActionBlock<T> instances. Each ActionBlock<T> will correspond to one of your 6 hash strategies.

var broadcast = new BroadcastBlock<byte[]>(x => x);

var strategy1 = new ActionBlock<byte[]>(input => DoHash(input, SHA1.Create()));
var strategy2 = new ActionBlock<byte[]>(input => DoHash(input, MD5.Create()));
// Create the other 4 strategies.

broadcast.LinkTo(strategy1);
broadcast.LinkTo(strategy2);
// Link the other 4.

using (var fs = File.Open(@"yourfile.txt", FileMode.Open, FileAccess.Read))
using (var br = new BinaryReader(fs))
{
  while (br.PeekChar() != -1)
  {
    broadcast.Post(br.ReadBytes(1024 * 16));
  }
}

The BroadcastBlock<T> will forward each chunk of data to all linked ActionBlock<T> instances.

Since your question focused more on how to get this all to occur concurrently I will leave the implementation of DoHash up to you.

private void DoHash(byte[] input, HashAlgorithm algorithm)
{
  // You will need to implement this.
}
Brian Gideon
  • 47,849
  • 13
  • 107
  • 150
  • This looks like a very interesting approach to multithreading. Shame it's in .net 4.5. For whatever reason, I have a hard enough time convincing myself to use .net 4.0, as it doesn't feel mainstream enough to me yet. – Mirrana Apr 27 '12 at 16:59
  • It seems this approach will not work. DoHash will be called for each input array of bytes. How should they be combined? – Peter Mar 18 '15 at 15:35
  • BroadcastBlock drops messages if his buffer full. That cuses wrong hash calculation if file read speed is higher than hashing speed. – Fedor Cherepanov Mar 21 '19 at 09:06