4

I came across this code http://support.microsoft.com/kb/320348 which made me wonder what would be the best way to compare 2 files in order to figure out if they differ.

The main idea is to optimize my program which needs to verify if any file is equal or not to create a list of changed files and/or files to delete / create.

Currently I am comparing the size of the files if they match i will go into a md5 checksum of the 2 files, but after looking at that code linked at the begin of this question it made me wonder if it is really worth to use it over creating a checksum of the 2 files (which is basically after you get all the bytes) ?

Also what other verifications should I make to reduce the work in check each file ?

Prix
  • 19,417
  • 15
  • 73
  • 132
  • I think this depends heavily on what you mean by equal. Do you want to test the equality of the text in the file or the actual bytes? What is the content of the file? Are spaces important(text equality)? The MD5 checksum would find a diff between 2 spaces and 1 space at the end of a line, where a simple text compare might not. – linuxuser27 Dec 14 '10 at 00:55
  • MD5 requires reading both files to the full and then doing the hash which can be time-consuming for large files. – Aliostad Dec 14 '10 at 00:57
  • That is why I was wondering about the the question and Aliostad and Anon made the points I wanted to know. – Prix Dec 14 '10 at 02:17
  • Related/duplicate: http://stackoverflow.com/q/1358510/161052 – JYelton May 19 '11 at 18:03

2 Answers2

5

Read both files into a small buffer (4K or 8K) which is optimised for reading and then compare buffers in memory (byte by byte) which is optimised for comparing.

This will give you optimum performance for all cases (where difference is at the start, middle or the end).

Of course first step is to check if file length differs and if that's the case, files are indeed different..

Aliostad
  • 80,612
  • 21
  • 160
  • 208
  • +1 thanks, initially I am already checking the size of the files before going into the checksum. But now I will be sure to change it from checksum to a stream with a small buffer, at any rate I both will detect wether a file is or not iqual right ? I binary, text or w/e the file is... – Prix Dec 14 '10 at 02:15
0

If you haven't already computed hashes of the files, then you might as well do a proper comparison (instead of looking at hashes), because if the files are the same it's the same amount of work, but if they're different you can stop much earlier.

Of course, comparing a byte at a time is probably a bit wasteful - probably a good idea to read whole blocks at a time and compare them.

Anon.
  • 58,739
  • 8
  • 81
  • 86