0

News portal company has two servers (OS = Centos 6):

First #1 server has about 1 million images (.jpg, .png) and server #2 got almost the same count - 1 million of images. Some of them are identic duplicates, some are resized duplicates, some are with blur, some without blur, some are totally unique images. File names mainly are also different.

The mission is to merge two servers media catalogue into one. After merge duplicates must be romoved (to free up storage).

I've made some tests with Imagemagick compare -metric RMSE, but i thought that this will take ages to compare each file with each file from two servers. So there will be 1mln x 1mln = 1 trillion operations, this will take ages...

Any suggestions here?

Sid
  • 4,302
  • 3
  • 26
  • 27
  • 1
    Could you check the MD5 checksum against each other? I feel like it may be faster than imagemagick but you're still doing the 1 trillion operations – IsThisJavascript May 01 '18 at 14:08
  • 1
    The problem with MD5 is that - first server were taking original photos (uploaded by journalists) and downsizing them into 1600x900 px resolution with different compression rates. Meanwhile second server where taking original photos and were doing nothing with them, just putting into disk. So md5 will be different for all the files. :( – Sid May 01 '18 at 14:22

1 Answers1

1

Use GNU Parallel to calculate just once, for each image:

  • a data-only checksum

  • a Perceptual Hash

Then discard all the ones with identical checksums and review the ones with similar perceptual hashes.


Get a checksum over the image data only (i.e. not including any meta-data like a different date in your images) using ImageMagick like this:

identify -format "%#" a.jpg
9e51c9cf53fddc7d318341cd7e6c6e34663e5c49f20ede16e29e460dfc63867

Links to Perceptual Hash generation:

Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
  • Hey Mark, thanks! Actually i was already checking your first mentioned link - https://stackoverflow.com/questions/25198558/matching-image-to-images-collection/25204466#25204466. Check second comment there - it's mine and that script throwing me errors. – Sid May 01 '18 at 14:24
  • If the script is giving errors, try running it like this so you can see what is happening... `bash -xv ./script` – Mark Setchell May 01 '18 at 15:03
  • Output: https://codeshare.io/21p9JB . As i understand `-gt` is not working. So've changed it to simple `>` sign and it seems to go without errors. But, anyway, the hash is = `0000000000000000` all the time for different images. I am debuging it.. And will let you know. – Sid May 01 '18 at 15:25
  • @fmw42 OP is referring to my code here https://stackoverflow.com/a/25204466/2836621 – Mark Setchell May 01 '18 at 17:38
  • 1
    Are you running the code wit `bash`? Have you installed **ImageMagick**? Which IM version are you using? If you are using v7, you'll need to change all instances of `convert` to `magick` as the program name changed. Likewise, all instances of `identify` need to be changed to `magick identify` at v7 - if you are using that. You are not on Windows are you? – Mark Setchell May 01 '18 at 17:40
  • @MarkSetchell yep, `Version: ImageMagick 7.0.7-28` , i've changed all `convert ` to `magick` , also i didn't find there any `identify` instances in your code. Also the result is following: `$ ./byb.sh 1/bird1.jpg 0`. :( I am on Centos 6 – Sid May 02 '18 at 05:10
  • Please run the script using `bash -xv theScript someImage.jpg` – Mark Setchell May 02 '18 at 06:12
  • 1
    I have just added a (bash unix ImageMagick) script to compute 4 different perceptual hashes that produce binary string hashes that can be stored in the image meta data. Also a script for computing the hamming distance between two binary string hashes. It can access the hashes from the meta data in the image. My scripts are at http://www.fmwconcepts.com/imagemagick/index.php – fmw42 May 03 '18 at 17:55