Best way to detect duplicate uploaded files in a Java Environment?

Question

As part of a Java based web app, I'm going to be accepting uploaded .xls & .csv (and possibly other types of) files. Each file will be uniquely renamed with a combination of parameters and a timestamp.

I'd like to be able to identify any duplicate files. By duplicate I mean, the exact same file regardless of the name. Ideally, I'd like to be able to detect the duplicates as quickly as possible after the upload, so that the server could include this info in the response. (If the processing time by file size doesn't cause too much of a lag.)

I've read about running MD5 on the files and storing the result as unique keys, etc... ~~but I've got a suspicion that there's a much better way.~~ (Is there a better way?)

Any advice on how best to approach this is appreciated.

Thanks.

UPDATE: I have nothing at all against using MD5. I've used it a few times in the past with Perl (Digest::MD5). I thought that in the Java world, another (better) solution might have emerged. But, it looks like I was mistaken.

Thank you all for the answers and comments. I'm feeling pretty good about using MD5 now.

As you noted, storing a hash of the file is a good solution. — Kirk Woll, Sep 15 '10 at 20:45
Calculating a hash was what I was going to suggest. Why don't you think it's a good approach? — Jack Leow, Sep 15 '10 at 20:56
MD5/SHA is a great solution. If you're ultra worried about preventing false negatives (by which I mean improperly declaring the file as a duplicate) you could start by comparing by digest and then if they're a match compare byte-by-byte. False negatives would really only happen as the result of deliberate malicious attempts at collisions though. — Mark Peters, Sep 15 '10 at 20:58
By the way, isn't this the entire philosophy behind certain versioning systems like git? If it's good enough for git, it's good enough for me. — Mark Peters, Sep 15 '10 at 21:00
@Maurice "What's wrong with md5?" - nothing at all. I've used it few times in the past (in a slightly different capacity) with Perl. I thought that in the Java world, another (better) solution might have emerged. But, based on the responses so far... it clearly looks like I was mistaken. — S.Jones, Sep 16 '10 at 00:06
@Mark I wasn't aware that this was the approach behind Git. That's very good to know - Thanks. — S.Jones, Sep 16 '10 at 00:11

score 6 · Accepted Answer · answered Sep 15 '10 at 21:15

6

While processing uploaded files, decorate the OutputStream with a DigestOutputStream so that you can calculate the digest of the file while writing. Store the final digest somewhere along with the unique identifier of the file (in hex as part of filename maybe?).

answered Sep 15 '10 at 21:15

BalusC

1,082,665
372
3,610
3,555

+1 That's great. It looks like DigestOutputStream is just what I need. Thanks – S.Jones Sep 16 '10 at 00:15

stacker · Answer 2 · 2010-09-15T21:11:12.530

You only need to add a method like this to your code and you're done. There's probably no better way. All the work is already done by the Digest API.

public static String calc(InputStream is ) {
        String output;
        int read;
        byte[] buffer = new byte[8192];

        try {
            MessageDigest digest = MessageDigest.getInstance("SHA-256"); //"MD5");
            while ((read = is.read(buffer)) > 0) {
                digest.update(buffer, 0, read);
            }
            byte[] hash = digest.digest();
            BigInteger bigInt = new BigInteger(1, hash);
            output = bigInt.toString(16);

        } 
        catch (Exception e) {
            e.printStackTrace( System.err );
            return null;
        }
        return output;
    }

Best way to detect duplicate uploaded files in a Java Environment?

2 Answers2

Linked