2

I want to create a Java application to identify duplicates. So far I can find duplicates only by name, but I also need size, file type, and maybe content. This is my code so far, using a HashMap:

public static void find(Map<String, List<String>> lists, File dir) {
    for (File f : dir.listFiles()) {
        if (f.isDirectory()) {
            find(lists, f);
        } else {
            String hash = f.getName() + f.length();
            List<String> list = lists.get(hash);
            if (list == null) {
                list = new LinkedList<String>();
                lists.put(hash, list);
            }
            list.add(f.getAbsolutePath());
        }
    }
}
David Harkness
  • 35,992
  • 10
  • 112
  • 134
Petru L
  • 43
  • 1
  • 8
  • Maybe [this question](https://stackoverflow.com/questions/304268/getting-a-files-md5-checksum-in-java) could be of help – Joakim Danielson Jul 02 '19 at 13:06
  • Have you tried to introduce some FileHeader class in which you use all the elements that matter to you? On basis of this class output some checksum - sha or md5 and use it for a key in Collectors.grouping()? – Konrad Szałkowski Jul 02 '19 at 13:12

4 Answers4

2

Considering 2 files equal if they have the same extension and the same file size is simply a matter of creating an object that represents this 'equality'. So, you'd make something like:

public class FileEquality {
    private final String fileExtension;
    private final long fileSize;

    // constructor, toString, equals, hashCode, and getters here.
}

(and fill in all the missing boilerplate: Constructor, toString, equals, hashCode, and getters. See Project Lombok's @Value to make this easy if you like). You can get the file extension from a file name by using fileName.lastIndexOf('.') and fileName.substring(lastIndex). With lombok all you'd have to write is:

@lombok.Value public class FileEquality {
    String fileExtension;
    long fileSize;
}

Then use FileEquality objects as keys in your hashmap instead of strings. However, just because you have, say, 'foo.txt' and 'bar.txt' that both happen to be 500 bytes in size doesn't mean these 2 files are duplicates. So, you want content involved too, but, if you extend your FileEquality class to include the content of the file, then 2 things come up:

  1. If you're checking content anyway, what does the size and file extension matter? If the content of foo.txt and bar.jpg are precisely the same, they are duplicates, no? Why bother. You can convey the content as a byte[], but note that writing a proper hashCode() and equals() implementation (which are required if you want to use this object as a key for hashmaps) becomes a little trickier. Fortunately, lombok's @Value will get it right, so I suggest you use that.

  2. This implies the entirety of the file content is in your JVM's process memory. Unless you're doing a check on very small files, you'll just run out of memory. You can abstract this away somewhat by not storing the file's entire content, but storing a hash of the content. Google around for how to calculate the sha-256 hash of a file in java. Put this hash value in your FileEquality and now you avoid the memory issue. It is theoretically possible to have 2 files with different contents which nevertheless hash to the exact same sha-256 value but the chances of that are astronomical, and more to the point, sha-256 is designed such that it is not mathematically feasible to intentionally make 2 such files to mess with your application. Therefore, I suggest you just trust the hash :)

Note, of course, that hashing an entire file requires reading the entire file, so if you run your duplicate finder on a directory containing, say, 500GB worth of files, then your application will require at the very least reading of 500GB, which will take some time.

rzwitserloot
  • 85,357
  • 5
  • 51
  • 72
  • If I use private static MessageDigest messageDigest; static { try { messageDigest = MessageDigest.getInstance("SHA-512"); } catch (NoSuchAlgorithmException e) { throw new RuntimeException("cannot initialize SHA-512 hash function", e); } } ,it's the same think like you say ? – Petru L Jul 02 '19 at 14:32
  • Really like Lombok, but we can use Java records instead of `@Value` now, no? – riddle_me_this Sep 02 '23 at 23:09
2

I used MessageDigest and checked some files and find the duplicates according to all the criteria I have listed in the title and description. Thank you all.

private static MessageDigest messageDigest;
static {
    try {
        messageDigest = MessageDigest.getInstance("SHA-512");
    } catch (NoSuchAlgorithmException e) {
        throw new RuntimeException("cannot initialize SHA-512 hash function", e);
    }
}   

and this is the result after implementation in the search code for duplicates

public static void find(Map<String, List<String>> lists, File dir) {
for (File f : dir.listFiles()) {
  if (f.isDirectory()) {
    find(lists, f);
  } else {
      try{
          FileInputStream fi = new FileInputStream(f);
          byte fileData[] = new byte[(int) f.length()];
                fi.read(fileData);
                fi.close();
                //Crearea id unic hash pentru fisierul curent
                String hash = new BigInteger(1, messageDigest.digest(fileData)).toString(16);
                List<String> list = lists.get(hash);
                if (list == null) {
                    list = new LinkedList<String>();
                }
                //Adăugați calea către listă
                list.add(f.getAbsolutePath());
                //Adauga lista actualizată la tabelul Hash
                lists.put(hash, list);

      }catch (IOException e) {
                throw new RuntimeException("cannot read file " + f.getAbsolutePath(), e);
            }

  }
}

}

Petru L
  • 43
  • 1
  • 8
0

I made this application long ago I found some of its source code for you if you want to learn.

this method works by comparing both of files bytes.

public static boolean checkBinaryEquality(File file1, File file2) {
    if(file1.length() != file2.length()) return false;
    try(FileInputStream f1 = new FileInputStream(file1); FileInputStream f2 = new FileInputStream(file2)){
            byte bus1[] = new byte[1024],
                 bus2[] = new byte[1024];
            // comparing files bytes one by one if we found unmatched results that means they are not equal
            while((f1.read(bus1)) >= 0) {
                f2.read(bus2);
                for(int i = 0; i < 1024;i++)
                    if(bus1[i] != bus2[i]) 
                        return false;
            }
            // passed
            return true;
    } catch (IOException exp) {
        // problems occurred so let's consider them not equal
        return false;
    }
}

combine this method with name and extension checking and you are ready to go.

Eboubaker
  • 618
  • 7
  • 15
0

copy-paste-example

  1. create a class that extends File

    import java.io.File;
    import java.io.FileInputStream;
    import java.io.IOException;
    import java.util.Arrays;
    
    public class MyFile extends File {
        private static final long serialVersionUID = 1L;
    
        public MyFile(final String pathname) {
            super(pathname);
        }
    
        @Override
        public boolean equals(final Object obj) {
            if (this == obj) {
                return true;
            }
            if (this.getClass() != obj.getClass()) {
                return false;
            }
            final MyFile other = (MyFile) obj;
            if (!Arrays.equals(this.getContent(), other.getContent())) {
                return false;
            }
            if (this.getName() == null) {
                if (other.getName() != null) {
                    return false;
                }
            } else if (!this.getName().equals(other.getName())) {
                return false;
            }
            if (this.length() != other.length()) {
                return false;
            }
            return true;
        }
    
        @Override
        public int hashCode() {
            final int prime = 31;
            int result = prime;
            result = (prime * result) + Arrays.hashCode(this.getContent());
            result = (prime * result) + ((this.getName() == null) ? 0 : this.getName().hashCode());
            result = (prime * result) + (int) (this.length() ^ (this.length() >>> 32));
            return result;
        }
    
        private byte[] getContent() {
            try (final FileInputStream fis = new FileInputStream(this)) {
                return fis.readAllBytes();
            } catch (final IOException e) {
                e.printStackTrace();
                return new byte[] {};
            }
        }
    }
    
  2. read base directory

    import java.io.File;
    import java.util.HashMap;
    import java.util.Iterator;
    import java.util.List;
    import java.util.Map;
    import java.util.Map.Entry;
    import java.util.Vector;
    
    public class FileTest {
        public FileTest() {
            super();
        }
    
        public static void main(final String[] args) {
            final Map<MyFile, List<MyFile>> duplicates = new HashMap<>();
            FileTest.handleDirectory(duplicates, new File("[path to base directory]"));
            final Iterator<Entry<MyFile, List<MyFile>>> iterator = duplicates.entrySet().iterator();
            while (iterator.hasNext()) {
                final Entry<MyFile, List<MyFile>> next = iterator.next();
                if (next.getValue().size() == 0) {
                    iterator.remove();
                } else {
                    System.out.println(next.getKey().getName() + " - " + next.getKey().getAbsolutePath());
                    for (final MyFile file : next.getValue()) {
                        System.out.println("        ->" + file.getName() + " - " + file.getAbsolutePath());
                    }
                }
            }
        }
    
        private static void handleDirectory(final Map<MyFile, List<MyFile>> duplicates, final File directory) {
            final File dir = directory;
            if (dir.isDirectory()) {
                final File[] files = dir.listFiles();
                for (final File file : files) {
                    if (file.isDirectory()) {
                        FileTest.handleDirectory(duplicates, file);
                        continue;
                    }
                    final MyFile myFile = new MyFile(file.getAbsolutePath());
                    if (!duplicates.containsKey(myFile)) {
                        duplicates.put(myFile, new Vector<>());
                    } else {
                        duplicates.get(myFile).add(myFile);
                    }
                }
            }
        }
    }