0

I have a folder with the same photos but having different names and I want to remove the duplicates (doesn't matter which ones) using Stream API, to which I'm fairly new.

I tried to use this method, but of course it's not that simple, and it's just deleting all files.

File directory = new File("D:\\Photos\\Test");

List<File> files = Arrays.asList(Objects.requireNonNull(directory.listFiles()));
files.stream().distinct().forEach(file -> Files.delete(file.toPath()));

I also tried to convert each file into a byte array and apply distinct() on the stream of byte arrays, but it didn't find any duplicates.

Is there a way to make this happen by using only stream?

Alexander Ivanchenko
  • 25,667
  • 5
  • 22
  • 46
VegetaSan
  • 35
  • 4
  • You *say* you don't care which files get deleted but it would be better to develop reusable code for when you *do*. I would tend to create a ```Map>``` where the key is the checksum (it could be ```String```). Once collected, you could go through it at leisure and delete based on pattern matching. – g00se Aug 27 '22 at 05:51

3 Answers3

2

but of course it's not that simple and it's just deleting all files

Sure thing, distinct() in the stream of File objects would preserve in the stream files having distinct paths (because equals() of the file doesn't care about the content, it compares paths), and since all files would have distinct paths they all get removed.

What you really need is a logic for determining the contents of two files are the same and since Java 12 we have method Files.mismatch() bytes of the specified files and returns the first index of mismatch, or -1 if they are identical.

Another important thing to note is that in this case, Stream IPA isn't the right tool because of the need of dealing with checked exceptions. Both mismatch() and delete() throw IOException (which common for method from Files class), and we can't propagate it outside the stream. Exception-handling logic inside the lambda looks ugly and completely defeats the readability. You have an option of extracting the code which invokes mismatch() and delete() into two separate method, but it would lead to duplication of the exception-handling logic.

The better option would be to use DirectoryStream as a mean of traversal, and handle exceptions right on the spot:

public static void removeDuplicates(Path targetFolder, Path originalFile) {
    
    try(DirectoryStream<Path> paths = Files.newDirectoryStream(targetFolder)) {
        
        for (Path path: paths) {
            if (Files.mismatch(path, originalFile) == -1 
                && !originalFile.equals(path)) { // files match & file isn't the original one

                Files.delete(path);
            }
        }
            
    } catch (IOException e) {
        e.printStackTrace();
    }
}

Sidenote: File class is legacy, avoid using it. Stick with Path and Files instead.

In case when there's no particular original file, and you need to analyze and clean up the folder from duplicates, you can fire the method shown above for every file in the folder. But it would result in reading the same files multiple times, which is not desired.

To avoid reading files multiple times, we can calculate a hash of every encountered file and offer every hash to a Set. If the hash gets rejected, that means that the file is a duplicate.

In the code below, SHA-256 is used as a hashing algorithm.

public static void removeDuplicates(Path targetFolder) {
    try (DirectoryStream<Path> paths = Files.newDirectoryStream(targetFolder)) {
        
        Set<String> seen = new HashSet<>();
        
        for (Path path : paths) {
            if (Files.isDirectory(path)) continue;
            
            if (!seen.add(getHash(path))) { // hash sum has been encountered previously - hence the fail is a duplicate
                
                Files.delete(path);
            }
        }
        
    } catch (IOException | NoSuchAlgorithmException e) {
        e.printStackTrace();
    }
}

public static String getHash(Path path) throws NoSuchAlgorithmException, IOException {
    MessageDigest md = MessageDigest.getInstance("SHA-256");
    md.update(Files.readAllBytes(path));
    return toHexadecimal(md.digest());
}

public static String toHexadecimal(byte[] bytes) {
    
    return IntStream.range(0, bytes.length)
        .mapToObj(i -> String.format("%02x", bytes[i]))
        .collect(Collectors.joining());
}

Note that although it's possible that two different files would produce the same hash, it's extremely unlikely. And the code shown above ignores the possibility of collisions.

If you wonder how the code which can handle collisions might look like, here is an extended version.

Alexander Ivanchenko
  • 25,667
  • 5
  • 22
  • 46
  • Wow, beautiful solution! Right now I'm trying to rewrite your method with only one parameter - targetFolder, so I could just feed the path of target folder, and progaramm deletes all dublicates(even if there are more than one) and leave only original files. So far there is no progress, I tried using double fori loops, but this solution works only in specifc cases and I'm wondering is there better solution to do this? Could you help me? – VegetaSan Aug 27 '22 at 07:38
  • 1
    @VegetaSan Sure, that's doable. I'll add such version in a couple of minutes. – Alexander Ivanchenko Aug 27 '22 at 07:58
  • 1
    amazing..... I have no idea how to express gratitude for your help, so I just say thank you very much! I hope that some day I would able to help someone, just like you. – VegetaSan Aug 27 '22 at 10:06
  • 1
    Alexander Ivanchenko, Дякую) – VegetaSan Aug 27 '22 at 10:16
0

A while ago I made an android app that compares two images and checks if the image is a duplicate or not, so I was having the same problem like this and after searching for a while I found the answer on StackOverflow but currently I don't have the answer link saved so i am sharing the code, maybe it gives you some idea or help.

public class Main {
    public static void main(String[] args) throws IOException {
        ImageChecker i = new ImageChecker();
        BufferedImage one = ImageIO.read(new File("img1.jpg"));
        BufferedImage two = ImageIO.read(new File("img2.jpg"));
        if(one.getWidth() + one.getHeight() >= two.getWidth() + two.getHeight()) {
            i.setOne(one);
            i.setTwo(two);
        } else {
            i.setOne(two);
            i.setTwo(one);
        }
        System.out.println(i.compareImages());
    }
}

public class ImageChecker {

    private BufferedImage one;
    private BufferedImage two;
    private double difference = 0;
    private int x = 0;
    private int y = 0;

    public ImageChecker() {

    }

    public boolean compareImages() {
        int f = 20;
        int w1 = Math.min(50, one.getWidth() - two.getWidth());
        int h1 = Math.min(50, one.getHeight() - two.getHeight());
        int w2 = Math.min(5, one.getWidth() - two.getWidth());
        int h2 = Math.min(5, one.getHeight() - two.getHeight());
        for (int i = 0; i <= one.getWidth() - two.getWidth(); i += f) {
            for (int j = 0; j <= one.getHeight() - two.getHeight(); j += f) {
                compareSubset(i, j, f);
            }
        }

        one = one.getSubimage(Math.max(0, x - w1), Math.max(0, y - h1),
                Math.min(two.getWidth() + w1, one.getWidth() - x + w1),
                Math.min(two.getHeight() + h1, one.getHeight() - y + h1));
        x = 0;
        y = 0;
        difference = 0;
        f = 5;
        for (int i = 0; i <= one.getWidth() - two.getWidth(); i += f) {
            for (int j = 0; j <= one.getHeight() - two.getHeight(); j += f) {
                compareSubset(i, j, f);
            }
        }
        one = one.getSubimage(Math.max(0, x - w2), Math.max(0, y - h2),
                Math.min(two.getWidth() + w2, one.getWidth() - x + w2),
                Math.min(two.getHeight() + h2, one.getHeight() - y + h2));
        f = 1;
        for (int i = 0; i <= one.getWidth() - two.getWidth(); i += f) {
            for (int j = 0; j <= one.getHeight() - two.getHeight(); j += f) {
                compareSubset(i, j, f);
            }
        }
        System.out.println(difference);
        return difference < 0.1;
    }

    public void compareSubset(int a, int b, int f) {
        double diff = 0;
        for (int i = 0; i < two.getWidth(); i += f) {
            for (int j = 0; j < two.getHeight(); j += f) {
                int onepx = one.getRGB(i + a, j + b);
                int twopx = two.getRGB(i, j);
                int r1 = (onepx >> 16);
                int g1 = (onepx >> 8) & 0xff;
                int b1 = (onepx) & 0xff;
                int r2 = (twopx >> 16);
                int g2 = (twopx >> 8) & 0xff;
                int b2 = (twopx) & 0xff;
                diff += (Math.abs(r1 - r2) + Math.abs(g1 - g2) + Math.abs(b1
                        - b2)) / 3.0 / 255.0;
            }
        }
        double percentDiff = diff * f * f / (two.getWidth() * two.getHeight());
        if (percentDiff < difference || difference == 0) {
            difference = percentDiff;
            x = a;
            y = b;
        }
    }

    public BufferedImage getOne() {
        return one;
    }

    public void setOne(BufferedImage one) {
        this.one = one;
    }

    public BufferedImage getTwo() {
        return two;
    }

    public void setTwo(BufferedImage two) {
        this.two = two;
    }
}

this code first compares the height and width of the image because the image may have different sizes and after that, it compares them pixel by pixel using the RGB code and returns the result.

Note:- all the code credit goes to the original writer but I don't remember the name, so if you are the writer plz tell me I will update your name and answer the link.

Rohit Bhati
  • 371
  • 2
  • 11
0

Try this one. You can do the clean up but I think this will work for you. I took the reference from this link.

How to compare images for similarity using java

public class FileCompare {

public static void main(String[] args) {
    File directory = new File("D:\\Photos\\Test");
    List<File> filesToBeDeleted = new ArrayList<>();
    List<File> files = Arrays.asList(Objects.requireNonNull(directory.listFiles()));
    IntStream.range(0, files.size() - 1).forEach(i -> {
        boolean bool = compareImage(files.get(i), files.get(i + 1));
        if (bool) {
            filesToBeDeleted.add(files.get(i + 1));
        }
    });

    filesToBeDeleted.stream().forEach(file -> {
        try {
            Files.delete(file.toPath());
        } catch (IOException e) {
            e.printStackTrace();
        }
    });

}

public static boolean compareImage(File fileA, File fileB) {
    try {

        BufferedImage biA = ImageIO.read(fileA);
        DataBuffer dbA = biA.getData().getDataBuffer();
        int sizeA = dbA.getSize();
        BufferedImage biB = ImageIO.read(fileB);
        DataBuffer dbB = biB.getData().getDataBuffer();
        int sizeB = dbB.getSize();

        if (sizeA == sizeB) {
            for (int i = 0; i < sizeA; i++) {
                if (dbA.getElem(i) != dbB.getElem(i)) {
                    return false;
                }
            }
            return true;
        } else {
            return false;
        }
    } catch (Exception e) {
        
        return false;
    }
}

}

rahulP
  • 244
  • 2
  • 6