Exclude duplicates( by content) in wallking trough directory - java

Question

Here is the code I wrote, with some help of course. But there is some bugs in the logic part I cant spot. I am pretty new to programming, and i little help wouldn't mind

public class Directories {

 public static void main(String[] args) {
    Path currentDir = Paths.get("/root"); // some directory
    displayDirectoryContents(currentDir);
}

public static void displayDirectoryContents(Path dir) {

    final List<Path> duplicates = new ArrayList<Path>();
    final List<Path> uniqueFiles = new ArrayList<Path>();   
    try {   
        final DirectoryStream<Path> stream = Files.newDirectoryStream(dir);
        for(Path entry : stream){
            if(Files.isDirectory(entry)){
                displayDirectoryContents(entry);
            } else {
                for(final Path alreadySeen : uniqueFiles){
                    if(isDuplicated(entry, alreadySeen)){
                        duplicates.add(entry);

                    } else {
                        uniqueFiles.add(entry);

                    }   
                }
            }
        }

    } catch (Exception e) {
        e.printStackTrace();
    }
}
private static final boolean isDuplicated(final Path first, final Path second){
    try{
        return Files.size(first) == Files.size(second) && 
                Arrays.equals(Files.readAllBytes(first), Files.readAllBytes(second));
    } catch (IOException e) {
        e.printStackTrace();
    }
    return false;

}

}

I would really appreciate some help. Thank you

Place all file names in a HahMap key=SHA-1 hash of file content, value = String containing file name and path. File with same content have same SHA-1 hash and will therefore only saved once in the HashMap. — Robert, Feb 17 '15 at 13:35
What about keeping a list or a Map of the files you retrieve on your way, and each time you go through a file you check in the list for its existence? — lateralus, Feb 17 '15 at 13:35
@Robert made a great point with the HashMap of file hashes. This is a great way to check for duplicate content. Incorporate this into your solution. — Erick Robertson, Feb 21 '15 at 14:26

fge · Answer 1 · 2015-02-17T17:32:56.033

2

Here is a solution based on Java 8, using Files.find():

public static void listDups(final Path baseDir)
{
    final BiPredicate<Path, BasicFileAttributes> filesOnly
        = (path, attrs) -> attrs.isRegularFile();

    final List<Path> uniqueFiles = new ArrayList<>();
    final List<Path> dups = new ArrayList<>();

    try (
        final Stream<Path> stream = Files.find(baseDir, filesOnly);
    ) {
        stream.forEach(path -> {
            final boolean alreadyFound = uniqueFiles.stream()
                .anyMatch(found -> sameContent(path, found));
            final List<Path> list = alreadyFound ? dups : uniqueFiles;
            list.add(path);
        }
    }

    return dups;
}

private static final sameContent(final Path first, final Path second)
    throws IOException
{
    return Files.size(first) == Files.size(second)
        && Arrays.equals(Files.readAllBytes(first), Files.readAllBytes(second));
}

Not ideal however; you might want to replace the Arrays.equals() with sequential reading from input streams from both files.

But that is a proof of concept.

edited Feb 17 '15 at 17:32

answered Feb 17 '15 at 13:50

fge

119,121
33
254
329

Do I need to enter 2 pats for comparing (or the same one)? Because i code should look like "listDuplicatingItems(dir)" – birkoff Feb 17 '15 at 14:01
@birkoff and you don't care "which comes first"? That is, if `a` and `b` are duplicates but `b` comes before `a` in the stream, is that OK if `a` is marked as a duplicate and not `b`? – fge Feb 17 '15 at 14:20
Well then, adapting the code is easy, right? Create a second list, and if two files are the same, add to that list before you `continue` and return that dup list instead – fge Feb 17 '15 at 14:32
Dude, I am really, really thankful to you. You did a great work, but did you test it with some folder? Because i am not getting anything from the lists, neither one of them. – birkoff Feb 17 '15 at 15:14
Ok, I saw what happens, it doesn't goes into a subdirectory – birkoff Feb 17 '15 at 15:18
You never said anything about traversing subdirectories, now, did you? And by the way, if you use Java 8, the code would be different – fge Feb 17 '15 at 15:25
I thinks i said it - "My task is to walk thru a directory recursively, get(return as a list) all the files from the directory and subdirectories". I am new to Java, I don't know Java 8, learning Java 6-7 at the moment (at school). Can you PLEASE help me fix it. I would be really greatful (if you have time of course) – birkoff Feb 17 '15 at 15:33
Well, I can but first you do have to tell me whether you intend to use Java 7 or Java 8; the code will be quite different. – fge Feb 17 '15 at 15:43
Well, it really doesn't matter. :) I have a task from the university. I am new to Java, I have some knowledge, but not that much, I can walk thru all the folders and subfolders, get all the files, but I dont have knowledge to exclude the duplicating files by content. It's very important to me. If I don't post solution, I may fail. :) I recommend the easiest way – birkoff Feb 17 '15 at 16:51
Please tell me you test the code, because it tells me gives me an error before the "return dups;" forgot some ") or ;" somewhere I cant find where.You also forgot to declare sameContent as a boolean. Did you test it? – birkoff Feb 18 '15 at 12:42
No, I wrote the code as is. But you know, you can fix the bugs yourself ;) – fge Feb 18 '15 at 16:43

score 0 · Answer 2 · edited May 23 '17 at 10:24

The questions should be arisen are:

How do you want to check for "duplicating by content"? Do you mean comparing them byte-by-byte? What if there are two files of 10Gb length? What if there are plenty of such files?
Suppose two files are equal by content. Which one do you want to include to the list?

In this answer I supposed the following details:

md5 is used to check files for similarity. In this question you can see how one can get it in java.

you don't care what the file is missed; you just need to exclude any duplicate. (it's done by (f1, f2) -> f1) merge function in code)

static byte[] md5(Path file) {
    try {
        MessageDigest digest = MessageDigest.getInstance("MD5");
        int read;
        byte[] buffer = new byte[4096];
        try (InputStream is = new FileInputStream(file.toFile())) {
            while ((read = is.read(buffer)) > 0) {
                digest.update(buffer, 0, read);
            }
        }
        return digest.digest();
    } catch (IOException | NoSuchAlgorithmException ex) {
        //handle it or
        throw new RuntimeException(ex);
    }
}
public static void main(String[] args) throws IOException {
    System.out.println("first attempt:");
    Files.list(Paths.get("/tmp/t")).forEach(System.out::println);
    System.out.println("second attempt:");
    Files.list(Paths.get("/tmp/t"))
        .collect(Collectors.toMap(f -> new BigInteger(md5(f)), f -> f, (f1, f2) -> f1))
        .values()
        .forEach(System.out::println);
}

Description: lets list all files we want to check and calculate md5 sum for each of them. Then put all pairs (md5, file) into map. Map will only keep one value(file) for one key(md5) by map definition. md5 values of files are equal if files are equal by content. Situation where two different(by content) files will have the same md5 values is almost impossible. So resulting map values would be unique files.

I created folder /tmp/t/ and files in it: 1 and 3 are equal but 2 is different. Output:

first attempt:
/tmp/t/2
/tmp/t/1
/tmp/t/3
second attempt:
/tmp/t/1
/tmp/t/2

The code I posted here only lists single directory contents. You can extend it to your use case using Files.walkFileTree or similar approach.

Note that there is no guarantee that two files with different contents have a different checksum; what is more, computing a checksum means reading the whole content of the file which may not be necessary — fge, Feb 17 '15 at 15:44
sure, checksum doesn't give such guarantee. however it allows to get rid of both 1) reading file for every comparison 2) storing file content in memory. More, collisions are rare so method can be considered valid - if it fits OP requests. — Sergey Fedorov, Feb 17 '15 at 15:49
I don't know why, but it highlights the "new BigInteger" part. And I can't compile it.. — birkoff, Feb 21 '15 at 19:26

Exclude duplicates( by content) in wallking trough directory - java

2 Answers2