1

Possible Duplicate:
What is a good way to check if an image is unique using PHP?

A user uploads an image (png, jpg, gif) via a form. I'm am using hash_file to check against the db to see if the image already has been uploaded but I am now noticing that it is not unique.

Is this a bug or should I be using something else to generate a unique ID for the files?

I guess the workaround would be md5(filesize($file) . $hash)?

UPDATE From the logs... first set is using md5_file, second from hash_file with sha256...

HASH: SELECT SiteID FROM tbl_image_hashes WHERE SiteID = 0 AND Hash = 'd41d8cd98f00b204e9800998ecf8427e'
HASH: SELECT SiteID FROM tbl_image_hashes WHERE SiteID = 0 AND Hash = 'd41d8cd98f00b204e9800998ecf8427e'

HASH: SELECT SiteID FROM tbl_image_hashes WHERE SiteID = 0 AND Hash = 'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855'
HASH: SELECT SiteID FROM tbl_image_hashes WHERE SiteID = 0 AND Hash = 'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855'
HASH: SELECT SiteID FROM tbl_image_hashes WHERE SiteID = 0 AND Hash = 'e3b0c44298fc1c
20130117T231016: booru.pixymedia.us/utilities/batchExistingUpload.php
HASH: SELECT SiteID FROM tbl_image_hashes WHERE SiteID = 0 AND Hash = 'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855'
HASH: SELECT SiteID FROM tbl_image_hashes WHERE SiteID = 0 AND Hash = 'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855'
HASH: SELECT SiteID FROM tbl_image_hashes WHERE SiteID = 0 AND Hash = 'e3b0c44298fc1c

And no the SQL is right... I've uploaded 3,000 files successfully with this function...

This is the hash generating code:

$fileHash = hash_file("sha256",$FILE["tmp_name"]);

$FILE is basically $_FILE, it's just what the function parameter is named as

Community
  • 1
  • 1
allanx2000
  • 694
  • 3
  • 9
  • 21

3 Answers3

5

d41d...427e and e3b0...b855 are the MD5 and SHA256 sums of the empty string (e.g, md5("") and sha256("")). The fact that you've got these in your database indicates that there is something wrong with your code -- you may be hashing the wrong filename at some point.

1

The problem with working with the image data is that the same image can be represented in many ways. This is especially true of GIFs, where the colour table can be in any order and the result is the same.

You should probably work out a way to hash the image itself. You could do this by reading the colour of each pixel and generating some kind of hash from that. Alternatively, you could try using GD to load the image, and then let it "normalize" it by having it output the image with imagegd(), and then using that to check uniqueness.

Niet the Dark Absol
  • 320,036
  • 81
  • 464
  • 592
  • Although as soon as any sort of lossy conversion (including colorspace or image compression) or altering conversion (watermark, etc) is in effect, it doesn't matter *where* the hash input comes from (so I doubt that working with the restored image itself to build the hash will lead to any noticible benefit) .. trying to find similar "probably the same" images is a much more difficult task .. –  Jan 18 '13 at 04:05
  • Well I don't need it to be that precise, it's more costly if an image is not a duplicate than actually removing the duplicate. For some reason md5_file was giving the same hash for completely different images... – allanx2000 Jan 18 '13 at 04:08
0

If you are getting the same hash value for different files, then consider the possibilities:

  1. The hash is being generated incorrectly (=> check the input! <=); or,
  2. The hash used is not of sufficient quality (SHA-x is sufficient); or,
  3. The hash implementation is broken (doubtful, and not the case here); or,
  4. They files really do have the same content (determined to be false)

The odds of accidental SHA-x collisions are extremely small; here is a probability table, which doesn't do justice indicating exactly how unlikely this is. This article on 160-bit hashes has a more comparable scale at the bottom .. there is a higher chance of being hit by a meteor!

In any case, #1 is indeed the culprit.

Hint: hash("sha256", "")

Community
  • 1
  • 1