23

TL;DR: I have an CMS system that stores attachments (opaque files) using SHA-1 of the file contents as the filename. How to verify if uploaded file really matches one in the storage, given that I already know that SHA-1 hash matches for both files? I'd like to have high performance.

Long version:

When an user uploads a new file to the system, I compute SHA-1 hash of the uploaded file contents and then check if a file with identical hash already exists in the storage backend. PHP puts the uploaded file in /tmp before my code gets to run and then I run sha1sum against the uploaded file to get SHA-1 hash of the file contents. I then compute fanout from the computed SHA-1 hash and decide storage directory under NFS mounted directory hierarchy. (For example, if the SHA-1 hash for a file contents is 37aefc1e145992f2cc16fabadcfe23eede5fb094 the permanent file name is /nfs/data/files/37/ae/fc1e145992f2cc16fabadcfe23eede5fb094.) In addition to saving the actual file contents, I INSERT a new line into a SQL database for the user submitted meta data (e.g. Content-Type, original filename, datestamp, etc).

The corner case I'm currently figuring out is the case where a new uploaded file has SHA-1 hash that matches existing hash in the storage backend. I know that the changes for this happening by accident are astronomically low, but I'd like to be sure. (For on purpose case, see https://shattered.io/)

Given two filenames $file_a and $file_b, how to quickly check if both files have identical contents? Assume that files are too big to be loaded into memory. With Python, I'd use filecmp.cmp() but PHP does not seem to have anything similar. I know that this can be done with fread() and aborting if a non-matching byte is found, but I'd rather not write that code.

Mikko Rantalainen
  • 14,132
  • 10
  • 74
  • 112
  • Are you trying to hedge against hash collisions? – jlew Sep 17 '13 at 12:34
  • Using hash is a good idea. As you've mentioned, probability of collision is astronomically low - so you can be sure in common case, that it will be ok. If not - let us know your case with content of those files :p – Alma Do Sep 17 '13 at 12:34
  • git is using sha1 so I think you'r a safe enough to use sha1 :) – Kakawait Sep 17 '13 at 12:36
  • I'm trying to avoid possibly losing the file contents because of a hash collision. And yes, if I ever see a collision, I'll keep both files. I would bet that in that case I will find that my permanent storage has bitrotted. (The changes of getting a random bit error on any storage device seems much higher than finding SHA-1 collision; I'd like to have a new copy of the corrupted file in this case, still.) – Mikko Rantalainen Sep 17 '13 at 12:42
  • @Kakawait: `git` also does compare-by-bytes test before trusting that the file is identical just because SHA-1 hash happens to match, as far as I know. – Mikko Rantalainen Sep 17 '13 at 12:43
  • Thanks @Mikko Rantalainen for this information. I didn't know – Kakawait Sep 17 '13 at 12:44
  • Maybe you can use another hashing function to compare if the other one produce the same result. – invisal Sep 17 '13 at 12:48

7 Answers7

31

If you already have one SHA1 sum, you can simply do:

if ($known_sha1 == sha1_file($new_file))

otherwise

if (filesize($file_a) == filesize($file_b)
    && md5_file($file_a) == md5_file($file_b)
)

Checking file size too, to somewhat prevent a hash collision (which is already very unlikely). Also using MD5 because it's significantly faster than the SHA algorithms (but a little less unique). Use sha1_file() if you want even less chance of a collision.


This is how to exactly compare two files against each other.
This will run considerably slower than a native hash function.

function compareFiles($file_a, $file_b)
{
    if (filesize($file_a) != filesize($file_b))
        return false;

    $chunksize = 4096;

    $fp_a = fopen($file_a, 'rb');
    $fp_b = fopen($file_b, 'rb');

    try
    {
        while (!feof($fp_a) && !feof($fp_b))
        {
            $d_a = fread($fp_a, $chunksize);
            $d_b = fread($fp_b, $chunksize);
            if ($d_a === false || $d_b === false || $d_a !== $d_b)
                return false;
        }
          
        return true;
    }
    finally
    {
        fclose($fp_a);
        fclose($fp_b);
    }
}
Cobra_Fast
  • 15,671
  • 8
  • 57
  • 102
  • The difference between MD5 and SHA-1 is easily dwarfed by the IO required to actually get the bits from the storage. The permanent file storage is mounted with NFS using 1Gbps connection, which is obviously the bottleneck for hashing the whole file. – Mikko Rantalainen Sep 18 '13 at 05:30
  • I'm already checking the file hashes (SHA-1). The corner case I'm trying to figure out is verifying that all the bytes match if SHA-1 hashes match and the file size is identical. I know that the changes for this happening is really low, but the code required to avoid even that low change is not that hard to write. – Mikko Rantalainen Sep 18 '13 at 06:33
  • 1
    @MikkoRantalainen I've added code to my answer that exactly compares the two files. – Cobra_Fast Sep 18 '13 at 10:37
  • 1
    You're missing two `fclose()` calls and the code would look better if you return immediately after failed `filesize()` test. It's a shame that PHP does not provide such functionality by default. – Mikko Rantalainen Sep 18 '13 at 12:26
  • what about memory and cpu issues? Think about you run this in a loop for several thousands of files. Do you think there will be a memory overhead? We know that there will be only two files being processed on each iteration step, and 4096 * 2 bytes will be consumed for one comparison. But what about cpu time? I tested this function in a loop for 6000 comparisons. After 8 minutes since the time I invoked the script I killed the process, because I didn't even know how longer would it run. On the other hand, the simpler expression `sha1_file($file_a) == sha1_file($file_b)` performed much better. – hswner Jun 27 '14 at 06:46
  • @hswner If you want to run my code for several thousand files, then PHP probably is already the wrong choice. You'd be much better off implementing it in C or C++ which will run about 40 times more CPU efficient (at least to my own experience). – Cobra_Fast Jun 27 '14 at 11:07
  • @Cobra_Fast There's no problem with your code. In fact it's how it must be. But hey, why do you take it personal? We're discussing in PHP and considering a usual case, where one might be working on a shared host where she's not going to have any chance to hack some C/C++. – hswner Jun 27 '14 at 13:02
  • 1
    What would be the best practice like for 500 image files with a size of 1Mb - 10Mb? SHA1, MD5 or the direct compare? What's performing best? – Karl Adler Sep 01 '15 at 15:31
  • 2
    `fread($fp_a, 4096)` returns empty string `""` at EOF. So this loop is infinite. You should add `while (!feof($fp_a) && ($b = fread($fp_a, 4096)) !== false)` – Collector Jul 14 '17 at 07:51
  • @KarlAdler: if you have e.g. 500 files that you don't know hash or the contents and want to find duplicates, first compare stat()s of those files. If file sizes differ, you don't need to compare contents. If you have only 1-2 possible options for duplicates (that is, identical file size) doing direct file compare using above code is best option. If you have more possible matches (at least 3 files with identical size), doing hashes first to find obvious non-duplicates should reduce total I/O required. If you know that headers of different files will differ, use above code in all cases. – Mikko Rantalainen Jan 28 '21 at 09:54
  • This never worked for me in 2021 but I knew the idea was correct. The answer here is a working version.... https://stackoverflow.com/a/3060247/1642731 – JSG Jun 11 '21 at 07:22
8

Update

If you want to make sure that files are equal then you should first check the file sizes and if they match then just diff the file content. This is much faster than using a hash function and will definitely give the correct result.


It is not required to load the whole file content into memory if you hash the contents using md5_file() or sha1_file() or another hash_function. Here comes an example using md5:

$hash = md5_file('big.file'); // big.file is 1GB  in my test
var_dump(memory_get_peak_usage());

Output:

int(330540)

In your example it would be:

if(md5_file('FILEA') === md5_file('FILEB')) {
    echo 'files are equal';
}

Further note, when you use a hash function you'll always have a situation where you need to decide between complexity on the one hand and the probability of collisions (meaning that two different messages produce the same hash) on the other hand.

hek2mgl
  • 152,036
  • 28
  • 249
  • 266
  • @hek2mgl: thanks, I didn't know that PHP implementation was sane enough to not read the whole file into the memory. I don't need to use `shell_exec()` and `sha1sum` anymore to handle big files. – Mikko Rantalainen Sep 17 '13 at 12:47
  • Yeah they are often forgotten :) .. Also have a look to other maybe faster hash functions. But these have to be called using `shell_exec()` again – hek2mgl Sep 17 '13 at 12:50
  • I wouldn't claim that `files are equal` in case md5 hash matches. I would claim that `files are probably equal` which is the case I already can claim when SHA-1 hashes match. – Mikko Rantalainen Sep 19 '13 at 07:15
  • @MikkoRantalainen If you want to make sure that they are equal hash functions don't suite at all. Use `diff` .. It is faster and **can** answer the question – hek2mgl Sep 19 '13 at 08:21
  • @hek2mgl hashing is very smart as a first step because the situation is that I have 2e6 files in permanent storage and I receive a new one. I have a list of existing SHA-1 for each stored file so I first compute SHA-1 for the new file. Any match with stored SHA-1 should be considered as candidate match, not a real match. – Mikko Rantalainen Sep 19 '13 at 12:38
3

When your files are big and binary, you can just test a few bytes of it from a few offsets. It should be much faster than any hashing function, especially that the function returns result by the first different character.

However, this method won't work for files with only a few differend characters. It's the best for big archives, videos and so on.

function areFilesEqual($filename1, $filename2, $accuracy)
{

    $filesize1 = filesize($filename1);
    $filesize2 = filesize($filename2);

    if ($filesize1===$filesize2) {

        $file1 = fopen($filename1, 'r');
        $file2 = fopen($filename2, 'r');

        for ($i=0; $i<$filesize1 && $i<$filesize2; $i+=$accuracy) {
            fseek($file1, $i);
            fseek($file2, $i);
            if (fgetc($file1)!==fgetc($file2)) return false;
        }

        fclose($file1);
        fclose($file2);

        return true;
    }

    return false;
}
sliwhas
  • 59
  • 2
2

Use Sha1 hash, just like you do. If they are equal, compare their md5 hashs and filesize also. If you THEN encounter a file that matches in all 3 checks, but is NOT equal - you just found the holy grail :D

dognose
  • 20,360
  • 9
  • 61
  • 107
  • I do one SHA-1 already to avoid comparing all the files in the permanent storage. Doing an another hash would get me nowhere because SHA-1 is already pretty good hash and the only way to get obviously better results is to compare the actual bytes. Doing any other hash requires re-reading the whole file from the storage and at that point, it makes more sense to compare bytes because if I find a difference, I can stop at the middle of the file, not unlike if I use an another hash function. – Mikko Rantalainen Sep 18 '13 at 06:37
1

So I came across this then found a question that answers it and really works.

2021... Things change so I figure I will post a link to that answer Here

A) Basically it uses fopen and fread as shown above but it works. The accepted answer always was returning different for me, even on the same file.

B) The fopen and fread method will be faster than sha1 or md5 methods if you can use it and I don't see why you couldn't.

Svish's Version from the link above....

function files_are_equal($a, $b)
{
  // Check if filesize is different
  if(filesize($a) !== filesize($b))
      return false;

  // Check if content is different
  $ah = fopen($a, 'rb');
  $bh = fopen($b, 'rb');

  $result = true;
  while(!feof($ah))
  {
    if(fread($ah, 8192) != fread($bh, 8192))
    {
      $result = false;
      break;
    }
  }

  fclose($ah);
  fclose($bh);

  return $result;
}
JSG
  • 390
  • 1
  • 4
  • 13
0

You can use the turbodepot library. It is pure PHP and will take care of this with a single line of code:

require 'path/to/your/dependencies/folder/turbocommons-php-X.X.X.phar';
require 'path/to/your/dependencies/folder/turbodepot-php-X.X.X.phar';

use org\turbodepot\src\main\php\managers\FilesManager;

$filesManager = new FilesManager();
$filesManager->isFileEqualTo('path/to/file1', 'path/to/file2');

You can see the code here, it basically compares first by size and then by chunks of data:

https://github.com/edertone/TurboDepot/blob/f74a12ac330ec49604403a2f60502ced591c6da8/TurboDepot-Php/src/main/php/managers/FilesManager.php#L129

By using this library you also get a massive amount of file system features like comparing two folders, searching on folders, mirroring folders and much more

More info here:

https://turboframework.org/en/blog/2020-11-03/check-if-two-files-are-identical-using-javascript-typescript-php

Jaume Mussons Abad
  • 706
  • 1
  • 6
  • 20
  • -1 this is an incorrect answer and the accepted answer already mentions this alternative (MD5 checksum) + provides the correct answer. The library you linked to literally doesn't compare file contents but filesize + MD5 sum but still requires full IO load for the file so you get all the performance hit without correct result. See here for an example how easy it's to create collisions with MD5 and you'll understand why this is bad idea: https://stackoverflow.com/q/933497/334451 – Mikko Rantalainen Dec 24 '21 at 09:40
  • Totally right! The library code has been updated to use the correct method, all unit tests pass exactly the same as using the previous hash method so this is better of course. A new release of the library (7.0.2) has been generated and this answer updated pointing to the new code. Could you please reconsider your comment now? – Jaume Mussons Abad Dec 25 '21 at 09:29
  • It's nice to see that the library has been fixed! I removed the negative vote but the comment cannot be modified anymore (comments in SO can be modified only for 5 minutes). – Mikko Rantalainen Dec 27 '21 at 17:41
  • Great, much appreciated! – Jaume Mussons Abad Dec 27 '21 at 20:03
-1

The following piece of code helps you to check whether the files are identical or not.

/***check equality of files*/

$file1="pics/star.jpg";

$file2="pics/dupe.jpg";

if(sha1_file($file1)==sha1_file($file2))

echo "Identical";

else

echo "Not Identical";
kleopatra
  • 51,061
  • 28
  • 99
  • 211
SwR
  • 612
  • 1
  • 8
  • 21
  • @Spooky:Ok.The code I posted is suitable for file with few bytes. – SwR Oct 29 '13 at 04:44
  • The question already said "given that I already know that SHA-1 hash matches for both files" so it's pretty much safe assumption that I know how to compute the SHA-1 hash (or "checksum"). I also know that files *may not* be identical despite the fact that SHA-1 hash matches (see http://stackoverflow.com/questions/2479348/is-it-possible-to-get-identical-sha1-hash). – Mikko Rantalainen Oct 29 '13 at 08:18