Image comparaison performance java

Question

i have this code below, but it is not efficient at all, it is very very slow and more pictures i have to compare more long time it takes.

For example i have 500 pictures, each process lasts 2 minutes, 500 x 2 min =1000 min !

the specificity is as soon as there is picture same as compared, move it to another folder. then retrieve the rest files to compare i++

any idea ?

public static void main(String[] args) throws IOException {

    String PicturesFolderPath=null;
    String removedFolderPath=null;
    String pictureExtension=null;
    if(args.length>0) {
         PicturesFolderPath=args[0];
         removedFolderPath=args[1];
         pictureExtension=args[2];
    }


    if(StringUtils.isBlank(pictureExtension)) {
        pictureExtension="jpg";
    }

    if(StringUtils.isBlank(removedFolderPath)) {
        removedFolderPath=Paths.get(".").toAbsolutePath().normalize().toString()+"/removed";
    }

    if(StringUtils.isBlank(PicturesFolderPath)) {
        PicturesFolderPath=Paths.get(".").toAbsolutePath().normalize().toString();
    }

    System.out.println("path to find pictures folder "+PicturesFolderPath);
    System.out.println("path to find removed pictures folder "+removedFolderPath);

    Collection<File> fileList = FileUtils.listFiles(new File(PicturesFolderPath), new String[] { pictureExtension }, false);

    System.out.println("there is "+fileList.size()+" files founded with extention "+pictureExtension);

    Iterator<File> fileIterator=fileList.iterator();
    //Iterator<File> loopFileIterator=fileList.iterator();

    File dest=new File(removedFolderPath);

    while(fileIterator.hasNext()) {
        File file=fileIterator.next();

        System.out.println("process image :"+file.getName());

        //each new iteration we retrieve the files staying
        Collection<File> list = FileUtils.listFiles(new File(PicturesFolderPath), new String[] { pictureExtension }, false);
        for(File f:list) {
            if(compareImage(file,f) && !file.getName().equals(f.getName()) ) {
                String filename=file.getName();
                System.out.println("file :"+file.getName() +" equal to "+f.getName()+" and will be moved on removed folder");
                File existFile=new File(removedFolderPath+"/"+file.getName());
                    if(existFile.exists()) {
                        existFile.delete();
                    }
                    FileUtils.moveFileToDirectory(file, dest, false);
                    fileIterator.remove();
                    System.out.println("file :"+filename+" removed");
                    break;

                }           
        }

    }

}


 // This API will compare two image file //
// return true if both image files are equal else return false//**
public static boolean compareImage(File fileA, File fileB) {        
    try {
        // take buffer data from botm image files //
        BufferedImage biA = ImageIO.read(fileA);
        DataBuffer dbA = biA.getData().getDataBuffer();
        int sizeA = dbA.getSize();                      
        BufferedImage biB = ImageIO.read(fileB);
        DataBuffer dbB = biB.getData().getDataBuffer();
        int sizeB = dbB.getSize();
        // compare data-buffer objects //
        if(sizeA == sizeB) {
            for(int i=0; i<sizeA; i++) { 
                if(dbA.getElem(i) != dbB.getElem(i)) {
                    return false;
                }
            }
            return true;
        }
        else {
            return false;
        }
    } 
    catch (Exception e) { 
        e.printStackTrace();
        return  false;
    }
}

See alternate ways of doing it here : https://stackoverflow.com/questions/11006394/is-there-a-simple-way-to-compare-bufferedimage-instances — Arnaud, Jul 24 '18 at 10:02
i think juste compare md5 is not enough, files have not same name, i think md5 use filename no ? it is as efficient as ImageIO ? thank you all — cyril, Jul 24 '18 at 10:08
and loop on each file, then every pixel could be veryyyyyyy long i will try it but longer than my code — cyril, Jul 24 '18 at 10:10
Note that MD5 doesn't care about the file name, only about the content. — Arnaud, Jul 24 '18 at 10:12
I think md5 is not the solution, i tried with copied images, it works, but if there is metadata differents in pictures, it will not work, as picture date time, i just tried and it found no duplicate, and there is !, but it found all manually copied files... and yes it is more fast but not working — cyril, Jul 24 '18 at 11:09
@Arnaud, this link is not the fix, first the code is the same, and for the example with fast... it does not work error casting class... — cyril, Jul 24 '18 at 11:26
I don't know how often the file names match but putting that check first will be a lot faster when the files really are the same — mike, Jul 24 '18 at 19:10
name is definitely not an option, no files have same name and same name don't mean same files.. it come from bulk download so, each file have its own name. thank you ! — cyril, Jul 25 '18 at 05:52

maaartinus · Answer 1 · 2018-07-26T22:27:02.590

The already mentioned answer should help you a bit, as considering the width and height of a picture should exclude more candidate pairs quickly.

However, you still have a big problem: For every new file, you read all old files. The number of comparisons grows quadratically and with doing ImageIO.read for every step, it simply must be slow.

You need some fingerprints, which can be compared very fast. You can't use fingerprinting over the whole file content as its infested by the metadata, but you can fingerprint the image data alone.

Just iterate over the image data of a file (like you do), and compute e.g., MD5 hash of it. Store it e.g., as a String in HashSet and you'll get a very fast lookup.

Some untested code

For every image file you want to compare, you compute (using Guava's hashing)

HashCode imageFingerprint(File file) {
    Hasher hasher = Hashing.md5().newHasher();
    BufferedImage image = ImageIO.read(file);
    DataBuffer buffer = image.getData().getDataBuffer();
    int size = buffer.getSize();
    for(int i=0; i<size; i++) {
        hasher.putInt(buffer.getElem(i));
    }
    return hasher.hash();
}

The computation works with the image data only, just like compareImage in the question, so the metadata get ignored.

Instead of searching for a duplicate in a directory, you compute the fingerprints of all its files and store them in a HashSet<HashCode>. For a new file, you compute its fingerprint and look it up in the set.

thank you as said before, md5 is not acceptable because picture come from internet, and same picture can have different metadata and md5 is based on file metadata. the mentioned answer is same as my code but not wit buffered image, maybe it can be fastly....maybe i can delete from the iterator the file checked, indeed more far we go less files we have to check (files already checked) thank you for your time ! — cyril, Jul 25 '18 at 05:50
@cyril md5 is not based on any metadata. md5 computes a hash *of what you feed in*. I wrote, you should feed md5 with the same data you're using for the comparison.No metadata included. — maaartinus, Jul 25 '18 at 17:48

Image comparaison performance java

1 Answers1

Some untested code