0

I'm writing a chrome extension that saves images from websites. In addition to saving the files themselves, I'd like to turn the images into some type of hash.

The objective is to index the images in a database so that I can easily determine if an image is a duplicate (independent of size, i.e., a thumbnail and a full-size image would be considered duplicates). I'm not really worried about images with slight differences (besides size).

I've tried to work with this library, but it's large, a bit slower than I'd like, and (ostensibly) not supported anymore.

I've also tried a number of phash algorithm implementations, but as near as I can tell, they're all intended for server-side use. I'm using webpack, which was unable to bundle any of the libs I tried (very possible this is user-error, i'm no webpack-pro).

Lastly, I tried converting the image to base64, but the results are 10k+ characters, and it's not clear to me this would work for images of different sizes.

Community
  • 1
  • 1
Brandon
  • 7,736
  • 9
  • 47
  • 72

1 Answers1

1

I would just implement a fast string hash in javascript. Convert the image to base64, then run a string hash on it:

https://www.npmjs.com/package/non-crypto-hash (these work in both node and the browser, you could bring this in with browserify)

or an algorithm you can convert: http://landman-code.blogspot.ca/2008/06/superfasthash-from-paul-hsieh.html

Assuming you don't need a cryptographically secure hash, these will probably be your speediest options.

EdH
  • 3,194
  • 3
  • 21
  • 23
  • this seems feasible. i'm pretty ignorant on hashing--is comparing hashes straightforward? that is, comparing them such that visually identical images (besides size) ? – Brandon Sep 13 '16 at 02:25
  • Well, the hash should provide very different results for even very similar base64 strings, or it's a broken hash. So if the hashes are the same, the base64 / images should be the same. You could get a hash collision, in which case you should deal with that. But the likelihood is quite low. Internally I can guarantee you that the JS engine is using a hash for every key in an object anyway. But using huge keys like that is probably a very bad idea - so shorten it with one of those functions. – EdH Sep 14 '16 at 23:13