2

I have to save the combination of lastname, firstname and birth-date of a person as a hash. This hash is later used to search for the same person with the exactly same properties. My question is, if SHA-1 is a meaningfull algorithm for this.

As far as I understand SHA-1, there is virtually no possibility that two different persons (with different attributes) will ever get the same hash-value. Is this right?

HCL
  • 36,053
  • 27
  • 163
  • 213

3 Answers3

1

If you want to search for a person knowing only those credentials, you could store the SHA-1 in the database(or MD5 for speed, unless you have like a quadrillion people to sample).

The hash will be worthless, as it stores no information about the person, but it can work for searching a database. You just want to make sure that the three pieces of information match, so it would be safe to just concatenate them:

user.hash = SHA1(user.firstName + user.DOB + user.lastName)

And when you query, you could check if the two match:

hash = SHA1(query.firstName + query.DOB + query.lastName)

for user in database:
  if user.hash == hash:
    return user

I put query.DOB in the middle because the first and last name might collide, like if JohnDoe Bob was born on the same day as John DoeBob. I'm not aware of numeric names, so I think this will stop collisions like those ;)

But if this is a big database, I'd try MD5. It's faster, but there is a chance of a collision (in your case, I can guarantee that one won't occur). The chance of a collision, however, is really small.

To put that into perspective, a collision is a 1 / 2^128 occurrence, which is:

                          1
---------------------------------------------------
340,282,366,920,938,463,463,374,607,431,768,211,456

And that's a little smaller than:

0.0000000000000000000000000000000000000293873 %

I'm pretty sure you're not going to get a collision ;)

Blender
  • 289,723
  • 53
  • 439
  • 496
  • +1 Good idea with placing DOB in the middle. It's a big database. Speed (hashing) is not a problem. The only thing I fear is a collision. Better to use SHA-256? – HCL Apr 13 '11 at 20:54
  • By big, I mean bigger than `2^128`. That's around `10^38`, which is `10^31` times larger than the size of the human population. I doubt you'll get one ;) – Blender Apr 13 '11 at 20:58
  • Better go with SHA-65536 just to be safe – Scott Driggers Jun 01 '23 at 21:09
1

Hash collisions are inevitable. However small can be the chance of the collision, you shouldn't really rely only on hash if you really want 100% identification.

If you use hashing to speed up database search, there is no need to use SHA256. Use whatever hash function your system has with the smallest size (MD5() for MySQL or you might even try CRC32, if your database is not-so-big). Just when you query table, you need to provide all conditions you are searching by:

SELECT * from user WHERE hash="AABBCCDD" AND firstname="Pavel" AND surname="Sokolov"

Databases maintain a value, that is called index cardinality. It's a measure of uniqueness of the data on the given index. So, you can index fields you want together with hash field and database will choose the most selective index for the query himself. Adding additional conditions doesn't affect performance negatively because most database can use only one index when selecting data from the table and they will select the one with the most cardinality value.

The database will need to first select all rows matches the index and then scan through them to discard rows that doesn't match other conditions.

If you cannot use the method I described, well, I think even MD5 collision probability is very low to occur on database of people names.

P.S. I hope you know, that you know that "the combination of lastname, firstname and birth-date of a person" is not enough to 100% identify a human? And sooner this combination will match than some hashes collide.

Vladislav Rastrusny
  • 29,378
  • 23
  • 95
  • 156
  • +1 It's not relevant/applicable for this specific task but thanks for the info of the index-cardinality. Good to know, I was not aware of that. – HCL Apr 13 '11 at 21:15
0

If you are concerned with collisions there is a good discussion here:

Understanding sha-1 collision weakness

If you have security concerns, I would consider SHA-256 instead.

Community
  • 1
  • 1
fredw
  • 1,409
  • 12
  • 23
  • Thanks for your answer. However I don't have to fear collision forced by an attacker. My question was only about the probability of a "natural" collision. Thanks anyway. – HCL Apr 13 '11 at 21:20
  • Initiating a SHA-1 collision attack by engineering two people with specified names and birth dates so that their hashes collides would be a **very impressive** feat. – Joachim Sauer Apr 14 '11 at 09:35