2

I currently am up against an issue with having to hash files that will cause too much memory pressure and I'm trying to find out if we can create a hash on the fly with a file stream.

While researching possibilities, I decided to write a quick little test and make sure that the MD5's ComputeHash returns the same hash between the method calls that take a string and a stream.

let CreateMD5HashFromString (value: string) =
     Convert.ToBase64String(MD5.Create().ComputeHash(Encoding.ASCII.GetBytes(value)))

let CreateMD5HashFromStream (value: Stream) =
     Convert.ToBase64String(MD5.Create().ComputeHash(value))

I'm testing the calls with the following unit test:

[<TestMethod>]
member this.``CreateMD5Hash is the same between a string and a file stream`` () =
    let sampleText = File.ReadAllText("Sample.txt")
    let textMD5 = Security.CreateMD5HashFromString(sampleText);
    let streamMD5 = Security.CreateMD5HashFromStream(File.OpenRead("Sample.txt"))

    Assert.AreEqual(textMD5, streamMD5)

It's reading a small sample file for the test. This test fails because the generated hashes are different. To me this seems incorrect, but not exactly sure. Does anyone know for sure if these should be the same?

Also, secondary question, am I saving myself memory issues by using the stream overload of ComputeHash or does it load the entire stream before hashing? I tried to dissasemble the related .NET Assembly, but get lost trying to track down what HashCore does under the hood.

Joshua Belden
  • 10,273
  • 8
  • 40
  • 56
  • It does *seem* wrong...maybe it's reading them with different encodings? Although I'm not sure if that would change the byte values or not... – mpen Jan 08 '13 at 00:38
  • I should have added, I've attempted to flush the stream prior to hashing, but that hasn't changed anything. – Joshua Belden Jan 08 '13 at 00:40
  • I think this is a duplicate of a similar post that looks like will solve this exact issue, http://stackoverflow.com/questions/2124468/possible-to-calculate-md5-or-other-hash-with-buffered-reads. – Joshua Belden Jan 08 '13 at 00:44

2 Answers2

4

It's actually pretty simple: you can't assume text is equal to its underlying binary representation.

In this sample which both creates and reads the sample text as ASCII, it works fine just as you'd expect:

public static void Main(string[] args)
{
    System.IO.File.WriteAllBytes("test", System.Text.Encoding.ASCII.GetBytes("test string"));

    var inputString = System.IO.File.ReadAllText("test");
    var inputBytes = System.IO.File.ReadAllBytes("test");
    var inputStream = new System.IO.FileStream("test", System.IO.FileMode.OpenOrCreate);

    var stringHash = Convert.ToBase64String(System.Security.Cryptography.MD5.Create().ComputeHash(System.Text.Encoding.ASCII.GetBytes(inputString)));
    var streamHash = Convert.ToBase64String(System.Security.Cryptography.MD5.Create().ComputeHash(inputStream));
    var bytesHash = Convert.ToBase64String(System.Security.Cryptography.MD5.Create().ComputeHash(inputBytes));

    Console.WriteLine("String hash: {0}", stringHash);
    Console.WriteLine("Stream hash: {0}", streamHash);
    Console.WriteLine("Bytes hash: {0}", streamHash);

    Console.WriteLine("\nMD5s {0}", stringHash == streamHash && streamHash == bytesHash ? "match" : "don't match");
}

With output

String hash: b421md6Yb6t6IWJbeRZYnA==
Stream hash: b421md6Yb6t6IWJbeRZYnA==
Bytes hash: b421md6Yb6t6IWJbeRZYnA==

MD5s match

However, this only works assuming the file on disk is plain ASCII. There is zero guarantee in any other case. For example, many non-ASCII files start off with a BOM (byte-order marker) to signify the type of encoding. This will be represented in the binary byte-array hash, but not in the string hash in the memory. UTF-8 and unicode in general can have a dozen different representations for the same string - strings can be normalized when loaded into a string object into a representation different from what's on the disk.

Mahmoud Al-Qudsi
  • 28,357
  • 12
  • 85
  • 125
  • This answered my question most correctly. I'm in charge of writing these files and so I'll be able to match the encoding. Thank you. – Joshua Belden Jan 08 '13 at 19:06
2

I think the key question is, what encoding is used in the source file?

The hashes will be the same if the byte array you get using Encoding.ASCII.GetBytes contains the same bytes as the Stream, but that will only be the case when you use the file contains the same encoding as the one used with GetBytes and there are no signatures in the file.

This has nothing to do with the MD5 function, so you can test that more easily by checking (asuming the file is less than 10kB - otherwise you need larger buffer):

let res1 = Encoding.ASCII.GetBytes(File.ReadAllText("test.txt"))
let buffer = Array.zeroCreate 10240
let size = File.OpenRead("D:\\temp\\test.fsx").Read(buf, 0, 10240)
let res2 = buffer.[0 .. size - 1]

res1 = res2 // Are the byte arrays the same?

When I tried running this, I had to solve two things:

  • A file I used was saved with UTF-8 with signature so there were 3 bytes at the beginning specifying the encoding (and I only got the same byte arrays if I used buffer.[3 .. size - 1]

  • I had to save the file with the same encoding (ASCII in this case, but getting this right might be tricky in general). Alternatively, you can specify the encoding when reading the file, but then you might be hashing nonsense text.

Tomas Petricek
  • 240,744
  • 19
  • 378
  • 553