3

As part of a Java based web app, I'm going to be accepting uploaded .xls & .csv (and possibly other types of) files. Each file will be uniquely renamed with a combination of parameters and a timestamp.

I'd like to be able to identify any duplicate files. By duplicate I mean, the exact same file regardless of the name. Ideally, I'd like to be able to detect the duplicates as quickly as possible after the upload, so that the server could include this info in the response. (If the processing time by file size doesn't cause too much of a lag.)

I've read about running MD5 on the files and storing the result as unique keys, etc... but I've got a suspicion that there's a much better way. (Is there a better way?)

Any advice on how best to approach this is appreciated.

Thanks.

UPDATE: I have nothing at all against using MD5. I've used it a few times in the past with Perl (Digest::MD5). I thought that in the Java world, another (better) solution might have emerged. But, it looks like I was mistaken.

Thank you all for the answers and comments. I'm feeling pretty good about using MD5 now.

S.Jones
  • 1,145
  • 2
  • 12
  • 28
  • As you noted, storing a hash of the file is a good solution. – Kirk Woll Sep 15 '10 at 20:45
  • Calculating a hash was what I was going to suggest. Why don't you think it's a good approach? – Jack Leow Sep 15 '10 at 20:56
  • MD5/SHA is a great solution. If you're ultra worried about preventing false negatives (by which I mean improperly declaring the file as a duplicate) you could start by comparing by digest and then if they're a match compare byte-by-byte. False negatives would really only happen as the result of deliberate malicious attempts at collisions though. – Mark Peters Sep 15 '10 at 20:58
  • By the way, isn't this the entire philosophy behind certain versioning systems like git? If it's good enough for git, it's good enough for me. – Mark Peters Sep 15 '10 at 21:00
  • @Maurice "What's wrong with md5?" - nothing at all. I've used it few times in the past (in a slightly different capacity) with Perl. I thought that in the Java world, another (better) solution might have emerged. But, based on the responses so far... it clearly looks like I was mistaken. – S.Jones Sep 16 '10 at 00:06
  • @Mark I wasn't aware that this was the approach behind Git. That's very good to know - Thanks. – S.Jones Sep 16 '10 at 00:11

2 Answers2

6

While processing uploaded files, decorate the OutputStream with a DigestOutputStream so that you can calculate the digest of the file while writing. Store the final digest somewhere along with the unique identifier of the file (in hex as part of filename maybe?).

BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
1

You only need to add a method like this to your code and you're done. There's probably no better way. All the work is already done by the Digest API.

public static String calc(InputStream is ) {
        String output;
        int read;
        byte[] buffer = new byte[8192];

        try {
            MessageDigest digest = MessageDigest.getInstance("SHA-256"); //"MD5");
            while ((read = is.read(buffer)) > 0) {
                digest.update(buffer, 0, read);
            }
            byte[] hash = digest.digest();
            BigInteger bigInt = new BigInteger(1, hash);
            output = bigInt.toString(16);

        } 
        catch (Exception e) {
            e.printStackTrace( System.err );
            return null;
        }
        return output;
    }
stacker
  • 68,052
  • 28
  • 140
  • 210