2

I have small image hosting and I realized there many duplicate content. I want to eliminate this problem in the future by using checksum or hash code where newly uploaded file will be hashed, compared with existing image hash database, deleted if it already exist and user will be presented with the existing image link. All in one instance

My setup is barebones Node.js+jQuery File Upload+2 directories(one for a forum upload, another one for direct web upload).

What is the best(fast&reliable) hash and database setup for me to do this given the possibilities there might be thousand or million files in each directory? I think MD5 or SHA1 is overkill and might take a lot of resources. I would like to know if there any simpler solution.

Statistics :
~1,000 image uploaded everyday
~400 kb average image size
~35,000 image in the server
~30% duplicated content (tested using MD5)

Mohd Nor
  • 23
  • 3

2 Answers2

0

MD5 is actually quite fast, more than fast enough for your use case. One anecdotal benchmark has it at about ~400 Megabytes per second on a single CPU (source). It wouldn't be the bottleneck in your server processing, and it is a reliable way to check for duplicate files. MD5 is vulnerable to collision attacks, but they must be painstakingly prepared; chance collisions are statistically impossible. It sounds like collisions wouldn't be too great of a problem in your application (but make sure you handle them anyway).

If you truly just want speed to the exclusion of reliability, you could go with CRC. It's not intended to be a true hash, just to detect errors in a byte stream. It has a relatively high collision rate of about 1 in a million. However, it's blazing fast; it's meant to be implemented in hardware on routers.

Community
  • 1
  • 1
jonvuri
  • 5,738
  • 3
  • 24
  • 32
0

How about the following approach:

  • When the user uploads the images, it creates the MD5 sum
  • The image is then stored using that MD5 sum as a filename
  • The original image name is stored on the FS as well, but as a symlink pointing to the MD5 name.
  • If a user uploads an image that is a duplicate, then you can check whether the MD5 name already exists and just create the symlink.

For converting the existing images into that structure, I'm sure a fairly simple shell script using md5sum, mv and ln -s would do the trick.

One other possibility is to use something like MongoDB to store the images in a DB, which may well be easier to cluster.

beny23
  • 34,390
  • 5
  • 82
  • 85