6

I have to fingerprint files to match doublets. What is recommended with Java in 2013? Should I also compare the file size, or is this a unnecessary check?

The probability of false positive should be very close to 0

EDIT: Lots of answers, thanks. What is the standard of backup software today? SHA-256? higher? I guess md5 is not suitable?

Stig
  • 1,974
  • 2
  • 23
  • 50
  • 128-bit or 256-bit hash are usually good for normal usage. You can also compare file size to put the files in different buckets, then only hash if there are more than 2 files with same size. – nhahtdh Mar 15 '13 at 20:17
  • Are third-party libraries permissible? Guava, at least, has features to make this significantly easier than what you'd have to do in pure Java. – Louis Wasserman Mar 15 '13 at 20:18
  • MD5 is perfectly suitable for this. Note that MD5 cannot be considered to be a **secure** hash algorithm anymore. You have to determine for your situation if that's important. (Is there a possibility that a hacker might try to fool your software into thinking that two files are the same?). – Jesper Mar 15 '13 at 20:26
  • @Jesper: If attacker has control to the 2 files, then it doesn't matter if it is MD5 or SHA1. If attacker has control to 1 file, it is still hard to find second preimage attack. – nhahtdh Mar 15 '13 at 20:50

2 Answers2

13

If the probability of false positives has to be zero, as opposed to "lower than the probability you will be struck by lightning," then no hash algorithm at all can be used; you must compare the files byte by byte.

For what it's worth, if you can use third-party libraries, you can use Guava to compare two files byte-by-byte with the one-liner

Files.asByteSource(file1).contentEquals(Files.asByteSource(file2));

which takes care of opening and closing the files as well as the details of comparison.

If you're willing to accept false positives that are less likely than getting struck by lightning, then you could do

Files.hash(file, Hashing.sha1()); // or md5(), or sha256(), or...

which returns a HashCode, and then you can test that for equality with the hash of another file. (That version also deals with the messiness of MessageDigest, of opening and closing the file properly, etcetera.)

Louis Wasserman
  • 191,574
  • 25
  • 345
  • 413
  • MD5 is OK, or even 64-bit hash is good enough for most purpose. The chance of collision is extremely low at the level of practical (non-secure) use: http://en.wikipedia.org/wiki/Birthday_paradox#Probability_table – nhahtdh Mar 15 '13 at 20:55
  • I think you want the method now "toByteArray" – Setheron Oct 15 '13 at 20:29
  • @Setheron: why do you say that? None of these operations require `toByteArray`. (Also, they should work even if the file is too big to fit into RAM.) – Louis Wasserman Oct 15 '13 at 20:41
  • @LouisWasserman My bad, for some reason the version of Guava I'm using didn't have asByteSOurce. I ended up doing "Files.equal(originalWallet, modifiedWallet) " – Setheron Oct 16 '13 at 18:49
1

Are you asking how to getting the md5 checksums of files in Java? If that's the case then read the accepted answers here and here. Basically, do this:

import java.security.DigestInputStream;
...
...

MessageDigest md_1 = MessageDigest.getInstance("MD5");
MessageDigest md_2 = MessageDigest.getInstance("MD5");
InputStream is_1 = new FileInputStream("file1.txt");
InputStream is_2 = new FileInputStream("file2.txt");
try {
  is_1 = new DigestInputStream(is_1, md_1);
  is_2 = new DigestInputStream(is_2, md_2);
}
finally {
  is_1.close();
  is_2.close();
}
byte[] digest_1 = md_1.digest();
byte[] digest_2 = md_2.digest();

// compare digest_1 and digest_2

Should I also compare the file size, or is this a unnecessary check?

It is unnecessary.

Community
  • 1
  • 1
Barney
  • 2,355
  • 3
  • 22
  • 37
  • is md5 for comparing files considered safe? – Stig Mar 15 '13 at 20:26
  • SHA is considered more secure than MD5. However, the probability of two different files sharing the same MD5 checksum is almost zero. – Barney Mar 15 '13 at 20:31
  • 1
    Again, comparing file size is unnecessary, but I think it is better to do, since we can skip expensive disk read to compute hash. – nhahtdh Mar 15 '13 at 20:58