7

I have an inputStream that I want to use to compute a hash and save the file to disk. I would like to know how to do that efficiently. Should I use some task to do that concurrently, should I duplicate the stream pass to two streams, one for the the saveFile method and one for thecomputeHash method, or should I do something else?

Mark Segal
  • 5,427
  • 4
  • 31
  • 69
Dave
  • 1,835
  • 4
  • 26
  • 44
  • 1
    I asked a similar question recently: http://stackoverflow.com/questions/10985282/generate-running-hash-or-checksum-in-c (the answers are likely applicable here due to the constraints), I assumed "hash" to mean MD5, SHAx, etc. –  Jun 20 '12 at 17:42
  • I have used SHA256Cng and can also save the file. My question is more about doing both either at the same time ( using tasks/futures ) or sequentially ( reading a filestream does move the internal pointer , so i can reset the pointer to zero or duplicate the pointer). I don't know which one is better and how to do it. – Dave Jun 20 '12 at 17:46
  • 4
    *muses about reading the linked question* (Also consider a "stream splitter", which could be used to potentially reduce some manual work of copying between two output streams.) –  Jun 20 '12 at 17:48

5 Answers5

3

What about using a hash algorithms that operate on a block level? You can add the block to the hash (using the TransformBlock) and subsequently write the block to the file foreach block in the stream.

Untested rough shot:

using System.IO;
using System.Security.Cryptography;

...

public byte[] HashedFileWrite(string filename, Stream input)
{
    var hash_algorithm = MD5.Create();

    using(var file = File.OpenWrite(filename))
    {
        byte[] buffer = new byte[4096];
        int read = 0;

        while ((read = input.Read(buffer, 0, buffer.Length)) > 0)
        {
            hash_algorithm.TransformBlock(buffer, 0, read, null, 0);
            file.Write(buffer, 0, read);
        }

        hash_algorithm.TransformFinalBlock(buffer, 0, read);
    }

    return hash_algorithm.Hash;
}
Matt Murrell
  • 2,321
  • 2
  • 23
  • 39
  • I'm not a big fan of the manual block processing, but this ought to work. (I think the CryptoStream is a simpler approach which comes down to being a pretty wrapper.) –  Jun 20 '12 at 19:56
  • Agreed. I generally avoid them like the plague (Thank God for the recent Stream.CopyTo method)... I think this is the best way to solve the problem tho. Also, a second read makes me think I have a bug where the final block is hashed twice... To be an accurate MD5, you would have to detect the EOS and handle the last block differently. – Matt Murrell Jun 20 '12 at 20:55
3

This method will copy and hash with chained streams.

private static byte[] CopyAndHash(string source, string target)
{
    using (var sha512 = SHA512.Create())
    {
        using (var targetStream = File.OpenWrite(target))
        using (var cryptoStream = new CryptoStream(targetStream, sha512, CryptoStreamMode.Write))
        using (var sourceStream = File.OpenRead(source))
        {
            sourceStream.CopyTo(targetStream);
        }

        return sha512.Hash;
    }
}

For a full sample, including cancellation and progress reporting, see https://gist.github.com/dhcgn/da1637277d9456db9523a96a0a34da78

Chris Benard
  • 3,167
  • 2
  • 29
  • 35
hdev
  • 6,097
  • 1
  • 45
  • 62
1

It might not be the best option, but I would opt to go for Stream descendant/wrapper, the one that would be pass-through for one actually writing the file to the disk.

So:

  • derive from Stream
  • have one member such as Stream _inner; that will be the target stream to write
  • implement Write() and all related stuff
  • in Write() hash the blocks of data and call _inner.Write()

Usage example

Stream s = File.Open("infile.dat");
Stream out = File.Create("outfile.dat");
HashWrapStream hasher = new HashWrapStream(out);
byte[] buffer=new byte[1024];
int read = 0;
while ((read=s.Read(buffer)!=0) 
{
    hasher.Write(buffer);
}
long hash=hasher.GetComputedHash(); // get actual hash
hasher.Dispose();
s.Dispose();
Daniel Mošmondor
  • 19,718
  • 12
  • 58
  • 99
0

Here is my solution, it writes an array of structs (the ticks variable) as a csv file (using the CsvHelper nuget package) and then creates a hash for checksum purposes using the suffix .sha256

I do this by writing the csv to a memoryStream, then writing the memory stream to disk, then passing the memorystream to the hash algo.

This solution is keeping the entire file around as a memorystream. It's fine for everything except multi-gigabyte files that would run you out of ram. If I had to do this again, I'd probably try using CryptoStream approach, but this is good enough for my foreseeable purposes.

I have verified via a 3rd party tool that the hashes are valid.

Here is the code:

//var ticks = **some_array_you_want_to_write_as_csv**

using (var memoryStream = new System.IO.MemoryStream())
            {
                using (var textWriter = new System.IO.StreamWriter(memoryStream))
                {
                    using (var csv = new CsvHelper.CsvWriter(textWriter))
                    {
                        csv.Configuration.DetectColumnCountChanges = true; //error checking
                        csv.Configuration.RegisterClassMap<TickDataClassMap>();
                        csv.WriteRecords(ticks);

                        textWriter.Flush();

                        //write to disk
                        using (var fileStream = new System.IO.FileStream(targetFileName, System.IO.FileMode.Create))
                        {
                            memoryStream.Position = 0;
                            memoryStream.CopyTo(fileStream);

                        }

                        //write sha256 hash, ensuring that the file was properly written
                        using (var sha256 = System.Security.Cryptography.SHA256.Create())
                        {
                            memoryStream.Position = 0;
                            var hash = sha256.ComputeHash(memoryStream);
                            using (var reader = System.IO.File.OpenRead(targetFileName))
                            {
                                System.IO.File.WriteAllText(targetFileName + ".sha256", hash.ConvertByteArrayToHexString());
                            }
                        }

                    }

                }
            }
JasonS
  • 7,443
  • 5
  • 41
  • 61
-2

You'll need to stuff the stream's bytes into a byte[] in order to hash them.

bluevector
  • 3,485
  • 1
  • 15
  • 18
  • 1
    You can pass a stream too. What would be the benefits of converting the stream to a byte[]? – Dave Jun 20 '12 at 18:20
  • I, for some reason, didn't see that overload. Ever. I shall go say 10 "Hail Bills Gates'" in penance. – bluevector Jun 20 '12 at 18:21
  • 1
    @Dave There is no advantage. Both the form that take a `byte[]` and a `Stream` are blocking and expect the entire data in one-shot. With threads and a special `Stream`... but that just adds more problems then it solves... –  Jun 20 '12 at 20:04
  • @bluevector I'd suggest just deleting this answer if possible. It's a really bad suggestion that will cause an immense amount of memory usage + `OutOfMemoryException` for large files. – Chris Benard Jan 25 '22 at 22:56