0

I was storing some files based on a checksum but I found a flaw that 2 checksums can be identical sometimes.

I always try looking for API instead of reinventing the wheel, but I can't find anything.

I know theres the JSR 268 and JackRabbit as a standard for content storage but my app is light-years of using such thing.

So, are there approaches for single Instance File Storage with Java or should I just keep searching for new algorithms for my checksum?

EDIT:

When numcheck is not working: 2 files are exactly the same, just in different file system locations. However when sent from the client is impossible on server side to know the path they were before, so it is the same file twice, same checksum.

If you wanna retrieve either one, how you check that?

Wanted to know if there was an standard approach, API, or an algorithm that could help me spot the difference

javaNoober
  • 1,338
  • 2
  • 17
  • 43
  • Use a MD5 or SHA1. Then they won't be duplicated. – bmargulies Jul 27 '11 at 00:40
  • To give you an idea of what not to worry about... SHA-1 is used by [git](http://git-scm.com/) to identify files ("blobs"). It's unable to distinguish between two files with the same SHA-1 signature - they are assumed to be the same file, and the difference is lost. As far as I know, no one has claimed to have lost any files. – Ed Staub Jul 27 '11 at 00:45
  • @Ed Staub I wrote a script to change the filename of a massive porn picture collection to their SHA-1 hash (to remove duplicates) and found several collisions. It is unlikely but it can happen. – Christopher Jul 27 '11 at 02:30
  • yes it can happen, thats my current problem actually, and the worst part is I'm storing the hash as primary key in a table for future references... SQL error for the win... – javaNoober Jul 27 '11 at 06:21
  • 1
    @Christopher: with a million pictures, chances are 1 part in 10^37th (10^12 / 2^160) that there would be a single collision. With a billion pictures, chances would be less than 1 part in 10^31st. With a trillion, less than one part in 10^25. So either that's a very impressive collection, or, more likely, there was a bug in the application or implementation of the SHA-1 algorithm. Even if you allow a large factor for imagined weaknesses in the hash, collisions are still wildly unlikely. – Ed Staub Jul 27 '11 at 11:51

2 Answers2

3

No matter how strong a hashing algorithm is, there is always a chance of a collision. A hashing algorithm generates a finite number of hashes from an infinite number of inputs.

Jeffrey
  • 44,417
  • 8
  • 90
  • 141
0

The only way to ensure that two files are not identical is to compare them bit by bit. Hashing them is easier and faster, but carries with it the risk of collision.