34

I want to compare two binary files. One of them is already stored on the server with a pre-calculated CRC32 in the database from when I stored it originally.

I know that if the CRC is different, then the files are definitely different. However, if the CRC is the same, I don't know that the files are. So, I'm looking for a nice efficient way of comparing the two streams: one from the posted file and one from the file system.

I'm not an expert on streams, but I'm well aware that I could easily shoot myself in the foot here as far as memory usage is concerned.

Palec
  • 12,743
  • 8
  • 69
  • 138
Simon Farrow
  • 1,901
  • 2
  • 20
  • 26

7 Answers7

43
static bool FileEquals(string fileName1, string fileName2)
{
    // Check the file size and CRC equality here.. if they are equal...    
    using (var file1 = new FileStream(fileName1, FileMode.Open))
        using (var file2 = new FileStream(fileName2, FileMode.Open))
            return FileStreamEquals(file1, file2);
}

static bool FileStreamEquals(Stream stream1, Stream stream2)
{
    const int bufferSize = 2048;
    byte[] buffer1 = new byte[bufferSize]; //buffer size
    byte[] buffer2 = new byte[bufferSize];
    while (true) {
        int count1 = stream1.Read(buffer1, 0, bufferSize);
        int count2 = stream2.Read(buffer2, 0, bufferSize);

        if (count1 != count2)
            return false;

        if (count1 == 0)
            return true;

        // You might replace the following with an efficient "memcmp"
        if (!buffer1.Take(count1).SequenceEqual(buffer2.Take(count2)))
            return false;
    }
}
Mehrdad Afshari
  • 414,610
  • 91
  • 852
  • 789
  • 6
    Require conunt1 == count2 could be inaccurate, as Stream.Read is free to return a block that has a length less than requested byte count. see http://msdn.microsoft.com/en-us/library/vstudio/system.io.stream.read(v=vs.100).aspx – Karata Nov 18 '14 at 08:49
  • Thanks for the solution Mehrdad. Do you need the Take calls? I tried only `if (!buffer1.SequenceEqual(buffer2))` and it seems to work. – Ozgur Ozturk Jan 08 '15 at 15:22
  • 1
    @Ozgur it works but it is less efficient and not very principled IMO. – Mehrdad Afshari Jan 08 '15 at 21:27
  • I agree with to comment by @Karata. You should make sure that the buffer is filled as much as possible by either repeatedly calling Stream.Read(), or use StreamReader.ReadBlock(). – Palec Feb 18 '16 at 13:55
  • @Palec I agree. For a FileStream that's generally not a problem though. I will rename the function to reflect that. – Mehrdad Afshari Feb 18 '16 at 22:03
  • 1
    [The docs](https://msdn.microsoft.com/en-us/library/system.io.filestream.read%28v=vs.110%29.aspx) say it may be a problem even for FileStream. Do you mean it is *usually* not a problem? Or are the docs misleading? – Palec Feb 18 '16 at 22:53
  • @Palec Well, I stand corrected then--if the docs say that for FileStream, that's the contract, even if it doesn't fail for local files. Will update the code. – Mehrdad Afshari Feb 19 '16 at 21:29
  • 2
    @Karata is absolutely right, and this faulty code should not stand uncorrected (somebody may use it on a piece of software I use later). At the very least `FileStreamEquals` should take two properly typed `FileStream` arguments; a *weak* case can *probably* be made that *usually* a `Read` request for n bytes from a file indeed reads n bytes if nothing went wrong. But would you bet your life (or your company) on every contingency? What about network mapped drives? Named pipes? – Peter - Reinstate Monica Apr 27 '16 at 08:12
  • Comment in February: *"Will update the code"* It's now September. When do you plan to fix the error in the above? – T.J. Crowder Sep 10 '16 at 14:18
  • As other said the (count1 != count2) is wrong. it works for FileStream in many devices but its so deceptive and dangerous. I recommended down vote – Mohammad Nikravan Feb 22 '20 at 00:41
  • @T.J.Crowder I'm still waiting – dafie Jun 14 '22 at 09:45
22

I sped up the "memcmp" by using a Int64 compare in a loop over the read stream chunks. This reduced time to about 1/4.

    private static bool StreamsContentsAreEqual(Stream stream1, Stream stream2)
    {
        const int bufferSize = 2048 * 2;
        var buffer1 = new byte[bufferSize];
        var buffer2 = new byte[bufferSize];

        while (true)
        {
            int count1 = stream1.Read(buffer1, 0, bufferSize);
            int count2 = stream2.Read(buffer2, 0, bufferSize);

            if (count1 != count2)
            {
                return false;
            }

            if (count1 == 0)
            {
                return true;
            }

            int iterations = (int)Math.Ceiling((double)count1 / sizeof(Int64));
            for (int i = 0; i < iterations; i++)
            {
                if (BitConverter.ToInt64(buffer1, i * sizeof(Int64)) != BitConverter.ToInt64(buffer2, i * sizeof(Int64)))
                {
                    return false;
                }
            }
        }
    }
Lars
  • 657
  • 8
  • 19
  • Is this advantageous only for 64-bit CPUs or will this help on 32-bit CPUs as well? – Pretzel Jan 24 '13 at 15:31
  • It should not matter if you have a 32-bit or 64-bit operating system running. But I never tried it on a pure 32-bit CPU. You have to try it and maybe just change Int64 to int32. But aren't most of more or less modern CPUs capable of 64-bit operations (x86 since 2004)? Go ahead and try it! – Lars Jan 25 '13 at 08:56
  • See comments on [this answer](http://stackoverflow.com/a/968980/157247). Relying on `count1` equalling `count2` is not reliable. – T.J. Crowder Sep 10 '16 at 14:14
  • Does this handle the last 0-7 bytes correctly? Did you test it on two files where `fileSize % sizeof(Int64) > 0` and only the last byte is different? – Dan Bechard May 02 '17 at 13:40
9

This is how I would do it if you didn't want to rely on crc:

    /// <summary>
    /// Binary comparison of two files
    /// </summary>
    /// <param name="fileName1">the file to compare</param>
    /// <param name="fileName2">the other file to compare</param>
    /// <returns>a value indicateing weather the file are identical</returns>
    public static bool CompareFiles(string fileName1, string fileName2)
    {
        FileInfo info1 = new FileInfo(fileName1);
        FileInfo info2 = new FileInfo(fileName2);
        bool same = info1.Length == info2.Length;
        if (same)
        {
            using (FileStream fs1 = info1.OpenRead())
            using (FileStream fs2 = info2.OpenRead())
            using (BufferedStream bs1 = new BufferedStream(fs1))
            using (BufferedStream bs2 = new BufferedStream(fs2))
            {
                for (long i = 0; i < info1.Length; i++)
                {
                    if (bs1.ReadByte() != bs2.ReadByte())
                    {
                        same = false;
                        break;
                    }
                }
            }
        }

        return same;
    }
wpp
  • 7,093
  • 4
  • 33
  • 65
JonPen
  • 87
  • 1
  • 1
6

The accepted answer had an error that was pointed out, but never corrected: stream read calls are not guaranteed to return all bytes requested.

BinaryReader ReadBytes calls are guaranteed to return as many bytes as requested unless the end of the stream is reached first.

The following code takes advantage of BinaryReader to do the comparison:

    static private bool FileEquals(string file1, string file2)
    {
        using (FileStream s1 = new FileStream(file1, FileMode.Open, FileAccess.Read, FileShare.Read))
        using (FileStream s2 = new FileStream(file2, FileMode.Open, FileAccess.Read, FileShare.Read))
        using (BinaryReader b1 = new BinaryReader(s1))
        using (BinaryReader b2 = new BinaryReader(s2))
        {
            while (true)
            {
                byte[] data1 = b1.ReadBytes(64 * 1024);
                byte[] data2 = b2.ReadBytes(64 * 1024);
                if (data1.Length != data2.Length)
                    return false;
                if (data1.Length == 0)
                    return true;
                if (!data1.SequenceEqual(data2))
                    return false;
            }
        }
    }
Larry
  • 297
  • 4
  • 6
  • But, the downside of this is that it allocates both files in memory. Much more efficient is using multiple `Read` on `FileStream` to get missing bytes – dafie Jun 14 '22 at 10:01
  • @dafie No, it does not read entire files in memory as you seem to suggest. Yes there is some buffering but I think you'll find the code is very efficient, reading both files sequentially in 64k chunks. – Larry Jun 15 '22 at 18:59
  • @I've done some benchmarking: https://pastebin.com/raw/ky9D8ynd – dafie Jun 15 '22 at 20:55
  • @dafie That is useful. My post was not intended to show an absolutely optimal method, just a simple one that averted the serious bug in a previous posting. If I interpret your post correctly, the ForceRead approach (which was not posted previously) is a bit more efficient. – Larry Jun 16 '22 at 21:19
3

if you change that crc to a sha1 signature the chances of it being different but with the same signature are astronomicly small

albertjan
  • 7,739
  • 6
  • 44
  • 74
  • You should never rely on that in most serious apps. It's like just checking the hash in a hashtable lookup without comparing the actual keys! – Mehrdad Afshari Jun 09 '09 at 09:14
  • 1
    unfortunately you can guarantee that the one time it messes up will be absolutely critical, probably that one big pitch. – Simon Farrow Jun 09 '09 at 09:17
  • @Simon - hehe very true. @Mehrdad - No probably not but it would greatly reduce the times you'd have to check to be super uber sure. – albertjan Jun 09 '09 at 09:22
  • Take the CRC and say file size and the changes are ever smaller. – kenny Apr 14 '10 at 12:34
  • 2
    @MehrdadAfshari a rather serious app like git relies on exactly this. To quote Linus Torvalds we will "quite likely never ever see it in [collision of two files by comparing sha's] the full history of the universe". Cf. http://stackoverflow.com/questions/9392365/how-would-git-handle-a-sha-1-collision-on-a-blob. – Maate May 08 '13 at 17:59
  • @Maate Git does not rely on SHA1s to verify single file equality (which is a much more likely scenario) though, but yeah, sometimes it makes sense. – Mehrdad Afshari May 08 '13 at 18:25
2

You can check the length and dates of the two files even before checking the CRC to possibly avoid the CRC check.

But if you have to compare the entire file contents, one neat trick I've seen is reading the bytes in strides equal to the bitness of the CPU. For example, on a 32 bit PC, read 4 bytes at a time and compare them as int32's. On a 64 bit PC you can read 8 bytes at a time. This is roughly 4 or 8 times as fast as doing it byte by byte. You also would probably wanna use an unsafe code block so that you could use pointers instead of doing a bunch of bit shifting and OR'ing to get the bytes into the native int sizes.

You can use IntPtr.Size to determine the ideal size for the current processor architecture.

Josh
  • 68,005
  • 14
  • 144
  • 156
0

I took the previous answers, and added the logic from the source code of BinaryReader.ReadBytes to get a solution that does not recreate buffer in every loop and does not suffer from unexpected return values from FileStream.Read:

public static bool AreSame(string path1, string path2) {
    int BUFFER_SIZE = 64 * 1024;
    byte[] buffer1 = new byte[BUFFER_SIZE];
    byte[] buffer2 = new byte[BUFFER_SIZE];

    int ReadBytes(FileStream fs, byte[] buffer) {
        int totalBytes = 0;
        int count = buffer.Length;
        while (count > 0) {
            int readBytes = fs.Read(buffer, totalBytes, count);
            if (readBytes == 0)
                break;

            totalBytes += readBytes;
            count -= readBytes;
        }

        return totalBytes;
    }

    using (FileStream fs1 = new FileStream(path1, FileMode.Open, FileAccess.Read, FileShare.Read))
    using (FileStream fs2 = new FileStream(path2, FileMode.Open, FileAccess.Read, FileShare.Read)) {
        while (true) {
            int count1 = ReadBytes(fs1, buffer1);
            int count2 = ReadBytes(fs2, buffer2);

            if (count1 != count2)
                return false;

            if (count1 == 0)
                return true;

            if (count1 == BUFFER_SIZE) {
                if (!buffer1.SequenceEqual(buffer2))
                    return false;
            } else {
                if (!buffer1.Take(count1).SequenceEqual(buffer2.Take(count2)))
                    return false;
            }
        }
    }
}
Łukasz Nojek
  • 1,511
  • 11
  • 17