29

I want to synchronize two directories. And I use

file_get_contents($source) === file_get_contents($dest)

to compare two files. Is there any problem to do this?

Baum mit Augen
  • 49,044
  • 25
  • 144
  • 182
xdazzyy
  • 311
  • 1
  • 3
  • 4
  • 1
    if the files go through a process on your server, maybe those processes can generate a hash for each time a file is added or modified, that way you only need to compare the hashes to know what to sync – Khaled.K Apr 09 '19 at 13:09

8 Answers8

35

I would rather do something like this:

function files_are_equal($a, $b)
{
  // Check if filesize is different
  if(filesize($a) !== filesize($b))
      return false;

  // Check if content is different
  $ah = fopen($a, 'rb');
  $bh = fopen($b, 'rb');

  $result = true;
  while(!feof($ah))
  {
    if(fread($ah, 8192) != fread($bh, 8192))
    {
      $result = false;
      break;
    }
  }

  fclose($ah);
  fclose($bh);

  return $result;
}

This checks if the filesize is the same, and if it is it goes through the file step by step.

  • Checking the modified time check can be a quick way in some cases, but it doesn't really tell you anything other than that the files have been modified at different times. They still might have the same content.
  • Using sha1 or md5 might be a good idea, but this requires going through the whole file to create that hash. If this hash is something that could be stored and used later, then it's a different story probably, but yeah...
Svish
  • 152,914
  • 173
  • 462
  • 620
  • 2
    This looks to me like the all-around best way to do things. The filesize check at the beginning will catch most changes right away without even needing to read the files, and not only do you also eliminate the overhead of hashing two entire files, but it will *stop* reading as soon as it finds a difference! – Dan Hlavenka Jun 17 '14 at 19:33
  • Your code sample is bogus as it does not properly check the EOF of both files. If $bh is larger than $ah, you may get TRUE. – Alexis Wilke Jun 26 '14 at 03:58
  • @AlexisWilke: Care to elaborate? If `$bh` is larger than `$ah`, wouldn't it already have returned `false` a few lines up? – Svish Jun 26 '14 at 08:54
  • My mistake, you compare the sizes first! So the loop does not need to take that in account... – Alexis Wilke Jun 28 '14 at 23:43
  • I ran a quick benchmark and this is definitely the faster way of calculating the file difference even if the files are identical up to the last byte. – Dom Oct 31 '15 at 12:47
25

Use sha1_file() instead. It's faster and works fine if you just need to see whether the files differ. If the files are large, comparing the whole strings to each other can be very heavy. As sha1_file() returns an 40 character representation of the file, comparing files will be very fast.

You can also consider other methods like comparing filemtime or filesize, but this will give you guaranteed results even if there's just one bit that's changed.

Tatu Ulmanen
  • 123,288
  • 34
  • 187
  • 185
  • 1
    `sha1_file()` has to hash the whole file. Surely it's no faster than a manual comparison. – Oli Jun 17 '10 at 09:37
  • @Oli, I assume that hashing the files and comparing 40 characters to other 40 characters is faster than comparing the whole file contents to each other. – Tatu Ulmanen Jun 17 '10 at 11:42
  • 5
    I'm not sure that's a fair assumption. Say you have two files, 2M chars long and the first characters are different. Hashing would read 4M chars, build two hashes, then compare 1 to 40 chars (depending on the similarity of the hashes). Direct comparison would read 2 chars and return. Extreme case but direct comparison will always read less data if the files are equal. – Oli Jun 17 '10 at 12:16
  • 2
    It depends on what you're doing with it. If you're just comparing two random files, then [my answer](https://stackoverflow.com/a/3060247/39321) is a lot more efficient. If you however is going to do these comparisons a lot *and* you can store that hash and reuse it later (so you only have to go through every file *once*), then hashing is likely better. – Svish Jun 26 '14 at 08:57
  • This discussion is exactly why I ended up here. I still believe it makes more sense (especially large files) to compare the files. First compare size (if not the same return false) then the type (same story) then start bit by bit (as soon as one bit is different return false). This instead of hashing two huge different sized and different kind of files to discover that after this expensive hash calculating they are different (in size, type and even the first bit of content). – Wilt Mar 25 '15 at 10:44
  • From some quick benchmarking, the hash method is always slower than direct comparison, even if you have to compare every byte in the files. The hash method has huge overhead because of the calculations involved in the hashing. Even through sha1_file is compiled c, that still can't make up for that overhead. For 10MB files, even if you have to compare every byte in the files, direct comparison is about 3.5 times faster than hash comparison. – Dom Oct 31 '15 at 12:46
  • 1
    sha1_file is not faster but is more memory efficient, I made a test on 1.2MB file, file_get_contents was 3.7 times faster, but sha1_file used only 5% of the memory used by file_get_contents compare – DeepBlue Nov 08 '15 at 20:16
  • The other downside to hashing is that there is the possibility of collisions. While this possibility is remote, it does exist. If you need 100% certainty about the file differences, hashing won't do it. – Nick Coons Oct 01 '20 at 19:26
  • After reading the comments, I think this answer's claim to be faster is misleading, If the author doesn't rewrite it, shouldn't this answer be downvoted? – user2553863 Dec 27 '21 at 19:28
5
  • Memory: e.g. you have a 32 MB memory limit, and the files are 20 MB each. Unrecoverable fatal error while trying to allocate memory. This can be solved by checking the files by smaller parts.
  • Speed: string comparisons are not the fastest thing in the world, calculating a sha1 hash should be faster (if you want to be 110% sure, you can compare the files byte-by-byte when hash matches, but you'll rule out all the cases where content and hash change (99%+ cases))
  • Efficiency: do some preliminary checks - e.g. there's no point comparing two files if their size differs.
Piskvor left the building
  • 91,498
  • 46
  • 177
  • 222
  • +1 for the file size check. Although, if there are text files, look out for automatic line ending conversion, which can be a hassle. – Boldewyn Jun 17 '10 at 08:53
  • @Boldewyn: Good point - Windows line ends are 2 bytes, UNIX/Mac are 1 byte. However, in this case, it would mean that one of the files *has* been changed (with the exception of FTP, where all sorts of crazy things happen). – Piskvor left the building Jun 17 '10 at 09:10
  • That's exactly the problem. If one of the folders is on the other end of a pipe that does line ending conversion, all text files will always be different. – Boldewyn Jun 17 '10 at 20:10
  • If something is doing line ending conversion, then I'd rather look into fixing *that*. If two files are identical except for the line endings, they should still be considered different in my opinion. – Svish Jun 26 '14 at 09:00
  • @Svish: Sure. Unfortunately, the FTP protocol (the usual culprit of such issues) refuses to die, for mysterious reasons - even though it's insecure and chock-full of weird quirks. – Piskvor left the building Jun 26 '14 at 09:51
4

Ths will work, but is inherently more inefficient than calculating checksum for both files and comparing these. Good candidates for checksum algorithms are SHA1 and MD5.

http://php.net/sha1_file

http://php.net/md5_file

if (sha1_file($source) == sha1_file($dest)) {
    /* ... */
}
2

Check first for the obvious:

  1. Compare size
  2. Compare file type (mime-type).
  3. Compare content.

(add comparison of date, file name and other metadata to this obvious list if those are also not supposed to be similar).

When comparing content hashing sounds not very efficient like @Oli says in his comment. If the files are different they most likely will be different already in the beginning. Calculating a hash of two 50 Mb files and then comparing the hash sounds like a waste of time if the second bit is already different...

Check this post on php.net. Looks very similar to that of @Svish but it also compares file mime-type. A smart addition if you ask me.

Community
  • 1
  • 1
Wilt
  • 41,477
  • 12
  • 152
  • 203
1

Seems a bit heavy. This will load both files completely as strings and then compare.

I think you might be better off opening both files manually and ticking through them, perhaps just doing a filesize check first.

Oli
  • 235,628
  • 64
  • 220
  • 299
1

There isn't anything wrong with what you are doing here, accept it is a little inefficient. Getting the contents of each file and comparing them, especially with larger files or binary data, you may run into problems.

I would take a look at filetime (last modified) and filesize, and run some tests to see if that works for you. It should be all you need at a fraction of the computation power.

Sam Becker
  • 19,231
  • 14
  • 60
  • 80
0

Something I noticed is there is a lack of the N! factor. In other words - to do the filesize() function you would first have to check every file against all of the other files. Why? What if the first file and the second file are different sizes but the third file is the same size.

So first - you need to get a list of all of the files you are going to work with If you want to do the filesize type of thing - then use the COMPLETE / string as the key for an array and then store the filesize() information. Then you sort the array so all files which are the same size are lined up. THEN you can check file sizes. However - this does not mean they really are the same - only that they are the same size.

You need to do something like the sha1_file() command and, like above, make an array where the keys are the / names are the keys and the values is the value returned. Sort those, and then just do a simple walk through the array storing the sha1_file() value to test against. So is A==B? Yes. Do any additional tests, then get rid of the SECOND file and continue.

Why am I commenting? I'm working on this same problem and I just found out my program did not work correctly. So now I'm going to go correct it using the sha1_file() function. :-)

Mark Manning
  • 1,427
  • 12
  • 14