I have small image hosting and I realized there many duplicate content. I want to eliminate this problem in the future by using checksum or hash code where newly uploaded file will be hashed, compared with existing image hash database, deleted if it already exist and user will be presented with the existing image link. All in one instance
My setup is barebones Node.js+jQuery File Upload+2 directories(one for a forum upload, another one for direct web upload).
What is the best(fast&reliable) hash and database setup for me to do this given the possibilities there might be thousand or million files in each directory? I think MD5 or SHA1 is overkill and might take a lot of resources. I would like to know if there any simpler solution.
Statistics :
~1,000 image uploaded everyday
~400 kb average image size
~35,000 image in the server
~30% duplicated content (tested using MD5)