7

I have sorted lists of filenames concatenated to strings and want to identify each such string by a unique checksum.

The size of these strings is a minimum of 100 bytes, a maximum of 4000 bytes, and an average of 1000 bytes. The total number of strings could be anything but more likely be in the range of ca. 10000.

Is CRC-32 suited for this purpose?

E.g. I need each of the following strings to have a different fixed-length (, preferably short,) checksum:

"/some/path/to/something/some/other/path"
"/some/path/to/something/another/path"
"/some/path"
...
# these strings can get __very__ long (very long strings are the norm)

Is the uniqueness of CRC-32 hashes increased by input length?

Is there a better choice of checksum for this purpose?

MCH
  • 415
  • 3
  • 11
  • If you already have a unique checksum then what is the issue? – Ryan Vincent Apr 14 '16 at 19:23
  • min 100 bytes, max 4000 bytes, average 1000 bytes. The issue is to shorten these strings, recalculate them, recalculate the checksum, then see if the checksum has been calculated before. I want to be sure that crc-32 is suited for this because I don't know much of hash functions and minimize collision probability. – MCH Apr 14 '16 at 19:23
  • 1
    How many entries are you expecting in total? guess? 1000, 10000 (1e5), one million (1e6), more? – Ryan Vincent Apr 14 '16 at 19:28
  • total entries could be any number but more likely be in the range of 10.000 – MCH Apr 14 '16 at 19:29
  • Thank you! (I really do not know much of cryptography.) (If you write an answer I'll accept it). – MCH Apr 14 '16 at 19:34
  • 1
    Would you mind adding the values in the comments to your question.? imo, it will help others to answer your question as they know the 'size of the problem'. Thanks for the information - it helps us. – Ryan Vincent Apr 14 '16 at 19:35
  • 2
    imo, I would look at 'md5' hash for your application. Fast and unlikey to give collisions. It is not to be used for _anything to do with security_, imo, it would be a quick lookup of the filename. – Ryan Vincent Apr 14 '16 at 19:42

1 Answers1

15

No.

Unless your filenames were all four characters or less, there is no assurance that the CRCs will be unique. With 10,000 names, the probability of at least two of them having the same CRC is about 1%.

This would be true for any 32-bit hash value.

The best way to assign a unique code to each name is to simply start a counter at zero for the first name, and increment for each name, assigning the counter as the code for that name. However that won't help you compute the code given just the name.

You can use a hash, such as a CRC or some other hash, but you will need to deal with collisions. There are several common approaches in the literature. You would keep a list of hashes with names assigned, and if you have a collision you could just increment the hash until you find one not used and assign that one. Then when looking up a name, you start at the computed hash and do a linear search for the name until you find it or an unused slot.

As for the hash, I would recommend XXH64. It is a very fast 64-bit hash. You do not need a cryptographic hash for this application, which would be unnecessarily slow.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • 1
  • Thank you! Sadly, I cannot just use a counter in this case. So ... to minimize collisions and to maximize speed, which hash function would you recommend? – MCH Apr 14 '16 at 19:52
  • 2
    You can minimize collisions all you like with longer hashes, but the probability of a collision will _never_ be zero. Unless you are happy with a program that will only probably work, then you will need to deal with collisions. – Mark Adler Apr 14 '16 at 19:54
  • 1
    @MCH Git has been working pretty well with 160-bit SHA-1, but it is still [vulnerable to collisions](http://stackoverflow.com/questions/9392365/how-would-git-handle-a-sha-1-collision-on-a-blob). Not that you're ever probable to come across one by accident. – Antti Haapala -- Слава Україні Apr 14 '16 at 19:59
  • This really makes me think. I mean hashes are used to verify file integrity of arbitrary-length files ... Of course I do not want a program that just 'probably' works ... but i do not see a way around a hash function here :( @Antti Haapala: Thank you! – MCH Apr 14 '16 at 20:02
  • @Mark Adler: Thank you for explaining how to handle the 'same hashes for different input' problem! – MCH Apr 14 '16 at 20:55