1

I have 2 directories with lots and lots of images, say: color/ and gray/. In color/ images are named: image1.png image2.png, etc.

I know that gray/ contains the same images, but in gray-scale, and the file names and order of files is different (eg: file_01.png, but this IS NOT the same image as image1.png).

Is it possible to make a comparison of images in both directories and copy color/ files to a results/ directory with gray/ file names?

Example:

directory        | directory           | directory
   "color/"      |     "gray/"         |      "results/" 
(color images)   | (grayscale images)  | (color images with gray-scale names)   
-----------------+---------------------+----------------------------------------
color/image1.png | gray/file324.png    | results/file324.png  (in color: ==>
                                       | this and image1.png are the same image)

I hope this is not very confusing, but I don't know how to explain it better.

I have tried with imagemagick, and it seems that the -compare option could work for this, but I'm unable to make a bash script or something that does it well.

Another way to say it: I want all color/*.jpg copied into the results/*.jpg folder using the correctly matching gray/*.jpg names.

EDIT (some notes): 1. The three images are IDENTICAL in size and content. The only difference is that two are in color and one is in gray-scale. And the name of the files, of course. 2. I uploaded a zip file with one sample image with their current names (folder "img1" is the color folder and folder "img2" is the grayscale folder) and the expected result ("img3" is the results folder), here: http://www.mediafire.com/?9ug944v6h7t3ya8

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
El Andi
  • 172
  • 3
  • 9
  • We could suggest a lot of different algorithms... but without access to some representative sample images, we would can't be sure which one had a chance to work for your specific case. Can you provide (links to) a set of 3 such similar images, 1 color, 1 grayscale, 1 'color with gray name'? – Kurt Pfeifle Sep 29 '12 at 08:00
  • Are the *img1/image1.png* and *img3/file324.png* images only **similar**, or are they **identical**? – Kurt Pfeifle Sep 29 '12 at 08:02
  • See edit, above. The three images are identical. The img1/image1.png and img3/file324.png are the same file, with different name. The image with the right name is in the img2/ folder, but that image is in grayscale. I need it in color. And I have thousand of images to process with that problem. I really want to have a software solution for this. – El Andi Oct 03 '12 at 22:10
  • 1
    Ok, I've looked at your samples now. I've thought about an algorithm using a few simple commands which should work with standard ImageMagick (no need for using the rather complex perceptual hash technique). I don't have the time to write it down just now, but maybe tonight or tomorrow... Stay tuned. – Kurt Pfeifle Oct 04 '12 at 10:51
  • How many 'thousands' of images exactly do you need to compare? (Assuming you have 3000 different color images to be compared, this leads to 4.5 million comparison. Let each comparison take 5 seconds only, and we end up with spending 22.5 million seconds, which takes about 260 days to run...) -- Therefor some basic, 'a-priori' performance considerations should be made when designing an algorithm... – Kurt Pfeifle Oct 04 '12 at 19:44
  • Thanks to you all! :D I will try all your suggestions and be back when I finish. Answering to Kurt: Right now, I'm making this with a sample of just around 600 - 900 images. However, I will use the best technique I can create probably over and over again in several batches of pictures. I think all your ideas will be very useful. Thank you so very much. – El Andi Oct 05 '12 at 05:34
  • And also thanks for all your corrections to my question, Kurt, I'm sorry for my poor English. :) – El Andi Oct 05 '12 at 07:03
  • 1
    Hehe... the renaming of the directory names I **had** to do -- the original ones kept confusing me when I was thinking about a solution. :-) – Kurt Pfeifle Oct 05 '12 at 10:42

3 Answers3

4

If I understood the requirement correctly, we need to:

  • find for each grayscale image named XYZ that is in folder gray/...
  • ...the matching color image named ABC that is in folder color/ and...
  • ...copy ABC to folder results/ under the new name XYZ

So the basic algorithm I suggest is this:

  1. Convert all images in folder color/ to grayscale and store result in folder gray-reference/. Keep the original names:

    mkdir gray-reference
    convert  color/img123.jpg  -colorspace gray  gray-reference/img123.jpg
    
  2. For each grayscale image in reference/ make a comparison with each grayscale image in folder gray/. If you find a match, copy the respective image of the same name from color/ to results/. One possible comparison command which creates a visual representation of differences is this:

    compare  gray-reference/img123.jpg  gray/imgABC.jpg  -compose src delta.jpg
    

The real trick is the comparison (as in step 2) of the two grayscale images. ImageMagick has a handy command to compare two (similar) images pixel by pixel and write the results into a 'delta' image:

compare  reference.png  test.png  -compose src  delta.png

If the comparison is for color images, in the delta image...

  • ...each pixel that was equal appears in white, while...
  • ...each pixel that was different appears in a highlight color (defaults to red).

See also my answer "ImageMagick: 'Diff' an Image" for an illustrated example of this technique.

If we directly compared a gray image with a color image pixel by pixel we would of course find that almost every single pixel is different (resulting in an all-red "delta" picture). Hence my proposal from step 1 above to first convert the color image to grayscale.

If we compare two grayscale images, the resulting delta image is in grayscale too. Hence the default highlight color can't be red. We better set it to 'black' in order to see it better.

Now if our current grayscale conversion of the color would result in a 'different' sort of grayscale than the one that the existing gray images have (our currently produced grays could just be slightly lighter or darker than the existing grayscale image due to different color profiles having been applied), it could still happen that our delta picture is all-"red", or rather all-highlight-color. However, I tested this with your sample images, and results are good:

 convert  color/image1.jpg  -colorspace gray  image1-gray.jpg  
 compare                  \
    gray/file324.jpg      \
    image1-gray.jpg       \
   -highlight-color black \
   -compose src           \
    delta.jpg

delta.jpg consists of 98% white pixels. I'm not sure if all the others of your thousands of grayscale images used the same settings when they were derived from the color originals. Therefor we add a small fuzz factor when running the compare command, which does allow for some deviation in color when 2 pixels are compared:

compare  -fuzz 3%  reference.png  test.png  -compose src  delta.png

Since this algorithm is to be executed many thousands of times (maybe several millions of times, given the number of images you talk about), we should make some performance considerations and we should time the duration of the compare command. This is especially a concern, since your sample images are rather large (3072x2048 pixels -- 6 Mega-Pixels), and the comparison could take a while.

My timing results on a MacBook Pro where these:

time (convert  color/image1.jpg  -colorspace gray  image1-gray.jpg ;
      compare                   \
         gray/file324.jpg       \
         image1-gray.jpg        \
        -highlight-color black  \
        -fuzz 3%                \
        -compose src            \
         delta100-fuzz.jpg)

  real  0m6.085s
  user  0m2.616s
  sys   0m0.598s

6 seconds for: 1 conversion of a large color image to grayscale, plus 1 comparison of two large grayscale images.

You talked about 'thousands of images'. Assuming 3000 images, based on this timing, the processing of all the images would require (3000*3000)/2 comparisons (4.5 million) and (3000*3000*6)/2 seconds (27 million sec). That's a total of 312 days to complete all comparisons. Too long, if you ask me.

What could we do to improve the performance?

Well, my first idea is to reduce the size of the images. If we compare smaller images instead of 3072x2048 sized ones, the comparison should return the result faster. (However, we will also spend additional time for first scaling down of our test images -- but hopefully much less time than we later save when comparing the smaller images:

time (convert color/image1.jpg  -colorspace gray  -scale 6.25%  image1-gray.jpg  ;
      convert gray/file324.jpg                    -scale 6.25%  file324-gray.jpg ;
      compare                  \
         file324-gray.jpg      \
         image1-gray.jpg       \
        -highlight-color black \
        -fuzz 3%               \
        -compose src           \
         delta6.25-fuzz.jpg)

   real  0m0.670s
   user  0m0.584s
   sys   0m0.074s

That's much better! We shaved off almost 90% of processing time, which gives hope to complete the job in 35 days if you use a MacBook Pro.

The improvement is only logical: by reducing the image dimension to 6.25% of the original the resulting images are only 192x128 pixels -- a reduction from 6 million pixels to 24.5 thousand pixels, a ratio of 256:1.

(NOTE: The -thumbnail and the -resize parameters would work a little bit faster than -scale does. However, this speed increase is a trade-off against quality loss. That quality loss would probably make the comparison much less reliable...)

Instead of creating a visually inspectable delta image from the compared images, we can tell ImageMagick to print out some statistics. To get the number of different pixels, we can use the AE metric. The command with its results is this:

time (convert color/image1.jpg -colorspace gray -scale 6.25% image1-gray.jpg  ;
     convert gray/file324.jpg                   -scale 6.25% file324-gray.jpg ;
     compare -metric AE  file324-gray.jpg image1-gray.jpg -fuzz 3% null: 2>&1 )
0 

  real  0m0.640s
  user  0m0.574s
  sys   0m0.073s

This means we have 0 differing pixels -- a result that we could directly use inside a shell script!

Building blocks for a Shell script

So here are the building blocks for a shell script to do the automatic comparison:

  1. Convert color images from 'color/' directory to grayscale ones, scale them down to 6.25% and save results in 'reference-color/' directory:

    # Estimated time required to convert 1000 images of size 3072x2048:
    #   500 seconds
    mkdir reference-color
    for i in color/*.jpg; do
        convert  "${i}"  -colorspace gray  -scale 6.25%  reference-color/$(basename "${i}")
    done
    
  2. Scale down images from 'gray/' directory and save results in 'reference-gray/' directory:

    # Estimated time required to convert 1000 images of size 3072x2048:
    #    250 seconds
    mkdir reference-gray
    for i in gray/*.jpg; do
        convert  "${i}"  -scale 6.25%  reference-gray/$(basename "${i}")
    done
    
  3. Compare each image from directory 'reference-gray/' with images from directory 'reference-color' until a match is found:

    # Estimated time required to compare 1 image with 1000 images:
    #    300 seconds
    # If we have 1000 images, we need to conduct a total of 1000*1000/2
    # comparisons to find all matches;
    #    that is, we need about 2 days to accomplish all.
    # If we have 3000 images, we need a total of 3000*3000/2 comparisons
    # to find all matches;
    #    this requires about 20 days.
    #
    for i in reference-gray/*.jpg ; do
    
        for i in reference-color/*.jpg ; do
    
            # compare the two grayscale reference images
            if [ "x0" == "x$(compare  -metric AE  "${i}"  "${j}" -fuzz 3%  null: 2>&1)" ]; then
    
                # if we found a match, then create the copy under the required name
                cp color/$(basename "${j}"  results/$(basename "${i}") ;
    
                # if we found a match, then remove the respective reference image (we do not want to compare again with this one)
                rm -rf "${i}"
    
                # if we found a match, break from within this loop and start the next one
                break ;
    
            fi
    
        done
    
    done
    

Caveat: Do not blindly rely on these building blocks. They are untested. I do not have a directory of multiple suitable images available to test this, and I do not want to create one myself just for this exercise. Proceed with caution!

Community
  • 1
  • 1
Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • Thank you so very very much! :D I wasn't expecting all this! Thank you again and again, sir! :D I will try your suggestions and be in touch with the results. – El Andi Oct 05 '12 at 05:29
  • this will totally work, it seems. The consumed time is not so very high for now, and this will be used over and over again against sets of very large pictures, yes, but the sets will be around 1000 1500 images, not 3000, so I think this could be done maybe in 20 days or so. Less than a month is perfect for me now. Thank you so much again! – El Andi Oct 05 '12 at 05:48
2

You should try if a perceptual hash technique such as pHash gives some good results on your concrete data.

A perceptual hash will give you a reliable similarity measure since the underlying algorithms are robust enough to take into account changes/transformations such as contrast adjustment or different compression/formats - which is not the case with standard cryptographic hash functions such as MD5.

In addition you can validate if pHash works by using its convenient web-based demo interface on your own images.

deltheil
  • 15,496
  • 2
  • 44
  • 64
  • This a very interesting suggestion, and I'm going to check the docs. the demo seems promising, and it output the result I'm expecting, that the two images, regardless their colorspace, are identical. However, I'm not a developer, and I'm not sure yet how to implement this solution. I'm going to try it anyway, thanks. – El Andi Oct 03 '12 at 22:17
1

Kurt's solution very much works after some tweaking and fiddling with the -fuzz option!. :) The final value for -fuzz that finally worked well is 50%! I tried with 3, 10, 19, 20, 24, 25, 30 and 40% with no success. Probably because the gray images were generated previously with a different method, so the grays are different. Also, all the images are of different sizes, some of them relatively small, so the scaling method by percentage produces bad results. I used -resize 200x, so all the reference images were more or less the same size, and finally this was the bash script I used:

    # this bash assumes the existence of two dirs: color/ and gray/ 
    # each one with images to compare

    echo Starting...
    echo Checking directories...
    if [ ! -d color ]; then
        echo Error: the directory color does not exist!
        exit 1;
    fi
    if [ ! -d gray ]; then
        echo Error: the directory gray does not exist!
        exit 1;
    fi

    echo Directories exist. Proceeding...

    mkdir reference-color
    echo creating reference-color...
    for i in color/*.png; do
        convert  "${i}"  -colorspace gray  -resize 200x  reference-color/$(basename "${i}")
    done
    echo reference-color created...

    mkdir reference-gray
    echo creating reference-gray...
    for i in gray/*.png; do
        convert  "${i}"  -resize 200x  reference-gray/$(basename "${i}")
    done
    echo reference-gray created...

    mkdir results
    echo created results directory...

    echo ...ready.

    echo "-------------------------"
    echo "|  starting comparison  |"
    echo "-------------------------"

    for i in reference-gray/*.png; do
        echo comparing image $i 

        for j in reference-color/*.png; do

            # compare the two grayscale reference images

            if [ "x0" == "x$(compare  -metric AE "${i}"  "${j}" -fuzz 50% null: 2>&1)" ]; then

                # if we found a match, then create the copy under the required name
                echo Founded a similar one. Copying and renaming it...
                cp color/$(basename "${j}")  results/$(basename "${i}")

                # if we found a match, then remove the respective reference image (we do not want to compare again with this one)
                echo Deleting references...
                rm -rf "${i}"
                rm -rf "${j}"
                echo "--------------------------------------------------------------"

                # if we found a match, break from within this loop and start the next one
                break ;

            fi

        done

    done
    echo Cleaning...
    rm -rf reference-color
    rm -rf reference-gray
    echo Finished!

The time measure is (for 180 images, using imagemagick in cygwin, so probably better in native linux imagemagick, I don't know yet):

real    5m29.308s
user    2m25.481s
sys     3m1.573s

I uploaded a file with the script and the set of test images if anyone is interested. http://www.mediafire.com/?1ez0gs6bw3rqbe4 (Is compressed with 7z format)

Thanks again!

El Andi
  • 172
  • 3
  • 9