Considering 2 files equal if they have the same extension and the same file size is simply a matter of creating an object that represents this 'equality'. So, you'd make something like:
public class FileEquality {
private final String fileExtension;
private final long fileSize;
// constructor, toString, equals, hashCode, and getters here.
}
(and fill in all the missing boilerplate: Constructor, toString, equals, hashCode, and getters. See Project Lombok's @Value to make this easy if you like). You can get the file extension from a file name by using fileName.lastIndexOf('.')
and fileName.substring(lastIndex)
. With lombok all you'd have to write is:
@lombok.Value public class FileEquality {
String fileExtension;
long fileSize;
}
Then use FileEquality
objects as keys in your hashmap instead of strings. However, just because you have, say, 'foo.txt' and 'bar.txt' that both happen to be 500 bytes in size doesn't mean these 2 files are duplicates. So, you want content involved too, but, if you extend your FileEquality
class to include the content of the file, then 2 things come up:
If you're checking content anyway, what does the size and file extension matter? If the content of foo.txt
and bar.jpg
are precisely the same, they are duplicates, no? Why bother. You can convey the content as a byte[]
, but note that writing a proper hashCode()
and equals()
implementation (which are required if you want to use this object as a key for hashmaps) becomes a little trickier. Fortunately, lombok's @Value
will get it right, so I suggest you use that.
This implies the entirety of the file content is in your JVM's process memory. Unless you're doing a check on very small files, you'll just run out of memory. You can abstract this away somewhat by not storing the file's entire content, but storing a hash of the content. Google around for how to calculate the sha-256 hash of a file in java. Put this hash value in your FileEquality
and now you avoid the memory issue. It is theoretically possible to have 2 files with different contents which nevertheless hash to the exact same sha-256 value but the chances of that are astronomical, and more to the point, sha-256 is designed such that it is not mathematically feasible to intentionally make 2 such files to mess with your application. Therefore, I suggest you just trust the hash :)
Note, of course, that hashing an entire file requires reading the entire file, so if you run your duplicate finder on a directory containing, say, 500GB worth of files, then your application will require at the very least reading of 500GB, which will take some time.