4

I am comparing personal info of individuals, specifically their name, birthdate, gender, and race by hashing a string containing all of this info, and comparing the hash objects' hexdigests. This produces a 32 digit hexadecimal number, which I am using as a primary key in a database. For example, using my identifying string would work like this:

>> import hashlib
>> id_string = "BrianPeterson08041993MW"
>> byte_string = id_string.encode('utf-8')
>> hash_id = hashlib.md5(bytesring).hexdigest()
>> print(hash_id)
'3b807ad8a8b3a3569f098a575091bc79'

At this point, I am trying to ascertain collision risk. My understanding is that MD5 doesn't have significant collision risk, at least for strings that are relatively small, which mine are (about 20-40 characters in length). However, I am not using the 128-bit digest object, but the 32 digit hexdigest.

Now, I believe the hexdigest is a compression of the digest (that is, it's stored in fewer characters), so isn't there an increased risk of collision when comparing hexdigests? Or am I off-base?

Brian Peterson
  • 2,800
  • 6
  • 29
  • 36
  • 2
    32 characters · 4 bit/character = 128 bit – Gumbo Aug 15 '13 at 20:09
  • Ah. So the hexdigest is the correct way to represent a hash, and doesn't cause an increased risk of non-uniqueness? – Brian Peterson Aug 15 '13 at 20:12
  • [`hexdigest`](http://docs.python.org/2/library/hashlib.html#hashlib.hash.hexdigest) returns just a hexadecimal representation of the binary [`digest`](http://docs.python.org/2/library/hashlib.html#hashlib.hash.digest). – Gumbo Aug 15 '13 at 20:12
  • Right. I guess my question is: don't different representations have different chances to be non-unique based on how many units of information they use to do the representation vs. how many units of information the original message takes to encode? And if so, what is the best representation to use? Um, let me preface your next answer with: talk to me like I'm 10. – Brian Peterson Aug 15 '13 at 20:21
  • MD5 processes an arbitrary-length message into a fixed-length output of 128 bits, typically represented as a sequence of 32 hexadecimal digits.http://stackoverflow.com/questions/3394503/maximum-length-for-md5-input-output – berkay Aug 15 '13 at 20:23
  • I read through that. So a final question: if, over a period of time, no collisions arise from a hashing function, can I assume that all my hexadecimal ids are unique? – Brian Peterson Aug 15 '13 at 20:32
  • I'm actually working on an open source project--you can find it [here](github.com/sc3/cookcountyjail)--which provides an API for data on the Cook County jail population. We decided not to include personally identifying information in the database at the outset, including name and birthdate (the latter of which can hugely narrow down who someone is). However, I am expanding our database schema-- until now we used a jail id assigned by the system as the primary key for the inmate object. Now, I want to track individuals themselves--still anonymously--so we can see the phenomena of recidivism. – Brian Peterson Aug 17 '13 at 00:45
  • Annoying SO comments. You can find it [here](http://github.com/sc3/cookcountyjail). – Brian Peterson Aug 17 '13 at 00:52
  • This is the quintessential use for hashing, is it not? We don't need (or want) to retrieve the original data used to get the hash, but we do want to be able to compare a unique, harmless object representation. – Brian Peterson Aug 17 '13 at 00:56

2 Answers2

5

Now, I believe the hexdigest is a compression of the digest (that is, it's stored in fewer characters), so isn't there an increased risk of collision when comparing hexdigests? Or am I off-base?

[...]

I guess my question is: don't different representations have different chances to be non-unique based on how many units of information they use to do the representation vs. how many units of information the original message takes to encode? And if so, what is the best representation to use? Um, let me preface your next answer with: talk to me like I'm 10

Old question, but yes, you were a bit off base, so to speak.

It’s the number of random bits that matters, not the length of the presentation.

The digest is just a number, an integer, which could be converted to a string using different amount of distinct digits. For example, a 128-bit number shown in some different radices:

"340106575100070649932820283680426757569" (base 10)
"ffde24cb47ecbff8d6e461a67c930dc1" (base 16, hexadecimal)
"7vroicmhvcnvsddp31kpu963e1" (base 32)

Shorter is nicer and more convenient (in auth tokens etc), but each representation has the exact same information and chance of collision. Shorter representations are shorter for the same reason as why "55" is shorter than "110111", while still encoding the same thing.

This answer might also clarify things, as well as toying with code like:

new BigInteger("340106575100070649932820283680426757569").toString(2)

...or something equivalent in other languages (Java/Scala above).

On a more practical level,

[...] which I am using as a primary key in a database

I don't see why not do away with any chance of collision by using a normal autoincremented id column (BIGINT AUTO_INCREMENT in MySQL, BIGSERIAL in PostgreSQL).

Community
  • 1
  • 1
Jonik
  • 80,077
  • 70
  • 264
  • 372
  • This is a great answer - thank you! Regarding the question of why not just use a normal autoincremented ID, if you use a hash, you can find the related record in a database simply by knowing the underlying data. However, if you use an autoincrement key (which I do use for some purposes) you will need that key to find the record (i.e. you can't go right to the record unless you already know the key). The hash also provides speed benefits when you want to quickly search for a long combination of fields without having to enumerate each field in your query. – Ben Feb 15 '18 at 14:19
2

An abbreviated 32-bit hexdigest (8 hex characters) would not be long enough to effectively guarantee a collision-free database of users.

The formula for the birthday collision probability is here:

What is the probability of md5 collision if I pass in 2^32 sets of string?

Using a 32-bit key would mean that your software would start to break at around 10,000 users. The collision probability would be about 1%. It gets a lot worse very fast after that. At 100,000 users, the collision probability is 69%.

A 64-bit key, and a 10 billion users is another breaking point of about 2.7% collision rate.

For 100 billion users (a generous upper bound of the earth's population for the foreseeable future), a 96-bit key is a little risky in my opinion: collision chance is about one in 100 million. Really, you need a 128-bit key, which gives you a collision rate of about 1X10^-17.

128-bit keys are 128/4 = 32 hex characters long. If you wanted to use, a shorter key, for aesthetic purposes, you need to use 23 alphanumeric characters to exceed 128 bits. Or if you use printable characters (ASCII 32-126), you could get away with 20 characters.

So when you're talking about users, you need at least 128 bits for a collision-free random key, or a 20-32 character long string, or a 128/8 = 16 byte binary representation.

Community
  • 1
  • 1
David Beckwith
  • 2,679
  • 1
  • 17
  • 11