17

say i have a file A.doc.
then i copy it to b.doc and move it to another directory.
for me, it is still the same file.
but how can i determine that it is?
when i download files i sometimes read about getting the mda5 something or the checksum, but i don't know what that is about.

Is there a way to check whether these files are binary equal?

John Saunders
  • 160,644
  • 26
  • 247
  • 397
Michel
  • 23,085
  • 46
  • 152
  • 242

3 Answers3

16

If you want to be 100% sure of the exact bytes in the file being the same, then opening two streams and comparing each byte of the files is the only way.

If you just want to be pretty sure (99.9999%?), I would calculate a MD5 hash of each file and compare the hashes instead. Check out System.Security.Cryptography.MD5CryptoServiceProvider.

In my testing, if the files are usually equivalent then comparing MD5 hashes is about three times faster than comparing each byte of the file.
If the files are usually different then comparing byte-by-byte will be much faster, because you don't have to read in the whole file, you can stop as soon as a single byte differs.

Edit: I originally based this answer off a quick test which read from each file byte-by-byte, and compared them byte-by-byte. I falsely assumed that the buffered nature of the System.IO.FileStream would save me from worrying about hard disk block sizes and read speeds; this was not true. I retested my program that reads from each file in 4096 byte chunks and then compares the chunks - this method is slightly faster overall than MD5 even when the files are exactly the same, and will of course be much faster if they differ.

I'm leaving this answer as a mild warning about the FileStream class, and because I still thinkit has some value as an answer to "how do I calculate the MD5 of a file in .NET". Apart from that though, it's not the best way to fulfill the original request.

example of calculating the MD5 hashes of two files (now tested!):

using (var reader1 = new System.IO.FileStream(filepath1, System.IO.FileMode.Open, System.IO.FileAccess.Read))
{
    using (var reader2 = new System.IO.FileStream(filepath2, System.IO.FileMode.Open, System.IO.FileAccess.Read))
    {
        byte[] hash1;
        byte[] hash2;

        using (var md51 = new System.Security.Cryptography.MD5CryptoServiceProvider())
        {
            md51.ComputeHash(reader1);
            hash1 = md51.Hash;
        }

        using (var md52 = new System.Security.Cryptography.MD5CryptoServiceProvider())
        {
            md52.ComputeHash(reader2);
            hash2 = md52.Hash;
        }

        int j = 0;
        for (j = 0; j < hash1.Length; j++)
        {
            if (hash1[j] != hash2[j])
            {
                break;
            }
        }

        if (j == hash1.Length)
        {
            Console.WriteLine("The files were equal.");
        }
        else
        {
            Console.WriteLine("The files were not equal.");
        }
    }
}
Coxy
  • 8,844
  • 4
  • 39
  • 62
  • 4
    Although this code is "shorter" it's certainly significantly slower and generally worse than comparing byte-by-byte. Not to mention MD5 is a dead hash (cryptographyically-speaking). I'd really not do this, if you just want to check that the files are equal. – Noon Silk Mar 02 '10 at 10:04
  • 1
    @silky: I suspect this would depend on the size of the files and whether the use case has a higher chance of them being equivalent or not, but computing the MD5s of two files is three times faster than comparing their contents byte-by-byte. – Coxy Mar 03 '10 at 01:29
  • 8
    @silky: MD5 being a dead hash cryptographically has little-to-no bearing on its use as a comparison hash for a large file. – Tanzelax Mar 03 '10 at 01:43
  • 2
    @Tanzelax: Learn to read. @coxymla: Excuse me? Exactly how is it faster to calculate an MD5 (which is done byte-by-byte, with various math operations) as opposed to *strictly* byte-by-byte. It's not possible. – Noon Silk Mar 03 '10 at 06:43
  • 1
    @silky: in theory, you'd be right. In practice, using .NET FileStream versus .NET MD5CryptoServiceProvider, getting the MD5 hashes of both files is significantly faster. – Coxy Mar 03 '10 at 07:09
  • @coxymla: I find that impossible to believe, but I'll look into it based on your insistence. – Noon Silk Mar 03 '10 at 08:29
  • @silky: sample code here - http://pastebin.com/3JMZPsfV – Coxy Mar 03 '10 at 08:40
  • 8
    @silky: The performance degradation is due to the overhead of the `ReadByte` call, where MD5 instead use `Read`, reading 4k bytes each time into a `byte[]`. – dalle Mar 03 '10 at 10:33
  • @dalle: Cheers for looking into that dalle. Of course, I wouldn't advocate *reading* byte-by-byte, only comparing. (I didn't actually notice that this is what the solution posted by "astander" does; that's bad). – Noon Silk Mar 03 '10 at 10:35
  • Nice discussion, haven't tested the speed but tried both ways: by comparing byte array and by comparing hash. in my 9800 files there was no difference in result. – Michel Mar 03 '10 at 20:46
  • @dalle - thanks for noticing that. You're absolutely correct and I've edited the original answer. After changing to reads in 4096 byte chunks, comparing byte-by-byte is indeed faster than calculating MD5 hashes but not "significantly". – Coxy Mar 04 '10 at 00:50
9

First compare the size of the files , if the size is not the same then the files are different , if the size is the same , then simply compare the files content.

user88637
  • 11,790
  • 9
  • 37
  • 36
2

Indeed there is. Open both files, read them in as byte arrays, and compare each byte. If they are equal, then the file is equal.

Noon Silk
  • 54,084
  • 6
  • 88
  • 105