27

I have some code on my PHP powered site that creates a random hash (using sha1()) and I use it to match records in the database.

What are the chances of a collision? Should I generate the hash, then check first if it's in the database (I'd rather avoid an extra query) or automatically insert it, based on the probability that it probably won't collide with another.

vbence
  • 20,084
  • 9
  • 69
  • 118
alex
  • 479,566
  • 201
  • 878
  • 984
  • 11
    Ask the question what will it cost you if there is a collision. If this is a free site fine. If you are running a money making business and an overrite will cost you a million dollar contract then I would think again. – Martin York Nov 18 '08 at 08:14
  • If you have to obfuscate some data in your url to hide data, you are doing something wrong. – Arkh Nov 18 '09 at 15:00
  • Why? imagine a scenario where you are selling digital goods and these goods can be accessed from an API. some are priced and some arent. This is the best way to refer to them via the url without the user changing the urls and getting other "non-authorized" applications to download. – Faisal Abid Nov 19 '09 at 21:04
  • Or you could implement access levels and check if people have access to your data before sending it to them blindly. Yeah, you need to put some work to do it, but you're paid for that, not to implement security by obscurity which has already failed enough. Never trust data coming from the user. – Arkh Nov 20 '09 at 09:00
  • I'm inclined to agree with this. Though there are cases where hashing data and being concerned about its uniqueness is important (Mercurial IDs come to mind), if you _have to_ hide your id's for security reasons, that's a very dangerous security model. And if you don't need to do so, why bother? – dimo414 Jun 21 '10 at 18:33
  • There is one obvious counter example to this: Password reset URLs. It is generally accepted as secure when you have elements working together. Something the user is given - the reset url; something they know or have - control over their email address and/or the answer to a secret question; something they must do - respond to the reset email before it expires. – Patrick M Aug 05 '12 at 20:23

11 Answers11

28

If you assume that SHA-1 does a good job, you can conclude that there's a 1 in 2^160 chance that two given messages have the same hash (since SHA-1 produces a 160-bit hash).

2^160 is a ridiculously large number. It's roughly 10^48. Even if you have a million entries in your database, that's still a 1 in 10^42 chance that a new entry will share the same hash.

SHA-1 has proved to be fairly good, so I don't think you need to worry about collisions at all.

As a side note, use PHP's raw_output feature when you use SHA-1 as this will lead to a shorter string and hence will make your database operations a bit faster.

EDIT: To address the birthday paradox, a database with 10^18 (a million million million) entries has a chance of about 1 in 0.0000000000003 of a collision. Really not worth worrying about.

Artelius
  • 48,337
  • 13
  • 89
  • 105
  • 18
    For all those that really believe in collision freedom, please remember the Birthday effect. Your first collision is randomly more likely to occur than you might imagine. So be careful anyways – Robert Gould Nov 18 '08 at 06:25
  • 1
    Yeah but one collision won't be the thing killing your system. Your own bug will. I don't think something that happen randomly once in a decade, except in a nuclear factory, would be something we should worry about. If I could have only that kind of annoyance ... ;-) – Bite code Nov 18 '08 at 06:38
  • 5
    The first collision has a 50% chance of happening after the first 2^80 hashes. – Seun Osewa Nov 24 '08 at 15:59
  • 6
    @Seun: No, that's completely wrong. Read about the birthday paradox. – Artelius Nov 19 '09 at 20:45
  • your system would fail if you had to create a DB record for every atom in the observable universe, which has been estimated to be 10^80! You're better off using a SHA1 hash concatenated with the record ID number. – Michael Butler Aug 13 '12 at 19:21
  • 2
    @Artelius Is "1 in 0.0000000000003" meant to be "1 in 3.333 Billion"? Or perhaps "A 0.0000000000003% chance"? Please correct me if I'm wrong. – Addison Aug 05 '16 at 09:56
  • The birthday paradox say exactly that @Artelius, I think that you have miss that it is a square root (sqrt(2^160) = 2^80). Notice that sqr(365) is around 20. – MrIo Dec 08 '20 at 19:52
15

Use a symmetric encryption scheme and a private server key to encrypt the ID (and other values) when you send them to the client and decrypt again on reception. Take care that your cryptographic function provides both confidentiality and integrity checks.

This allows you to use sensible values when talking to the DB without any collision, great security when talking to the client and reduces your probability to land on thedailyWTF by approximatly 2^160.

See also Pounding A Nail: Old Shoe or Glass Bottle?!

David Schmitt
  • 58,259
  • 26
  • 121
  • 165
14

why not do something which guarantees there'll be no collisions, as well as makes sure that no one can change a GET parameter to view something they shouldn't: using a salt, combine the id and its hash.

$salt = "salty";
$key = sha1($salt . $id) . "-" . $id;
// 0c9ab85f8f9670a5ef2ac76beae296f47427a60a-5

even if you do accidentally stumble upon two numbers which have the exact same sha1 hash (with your salt), then the $key will still be different and you'll avoid all collisions.

nickf
  • 537,072
  • 198
  • 649
  • 721
  • 1
    preferably use HMAC (hash_hmac in PHP) which, we're told, addresses some weaknesses with this kind of simple scheme. http://en.wikipedia.org/wiki/HMAC – araqnid Nov 18 '09 at 14:16
5

If you use numerically increasing IDs as input, then chances are practically zero that SHA-1 will collide.

If the ID is the only input, then SHA-1 seems to be quite some overkill - producing a 160 bit hash from a 32-bit integer. I would rather use modular exponentiation, e.g. chose a large (32-bit) prime p, compute the modular generator g of that group, and then use g^id. This will be guaranteed collision free, and only give 32-bit "hashes".

Martin v. Löwis
  • 124,830
  • 17
  • 198
  • 235
  • The id is not the only input. There is some specific data and time() rand() to mix things up a bit. – alex Nov 18 '08 at 06:21
  • 2
    Then just generating 160 random bits will be sufficiently unique - no need to generate a hash of anything (it won't become more unique through the hash, nor will it become more random). – Martin v. Löwis Nov 18 '08 at 13:36
4

SHA-1 produces 160 bit long digest. Therefore you are safe as long as you have less than 2^(160/2) entries. Division by 2 is due to birthday paradox.

Szere Dyeri
  • 14,916
  • 11
  • 39
  • 42
  • 8
    "Safe" is surely a relative term. It's not that it's "safe" up to some point and "unsafe" afterwards. It only makes sense to talk about probabilities of collisions at certain points. The OP may need a "one in a million chance or better" or he may need "one in a billion". – Jon Skeet Nov 18 '08 at 06:28
  • @Szere Dyeri Remember randomness is not predictable :) – Robert Gould Nov 18 '08 at 06:33
  • 1
    Jon, you are right. To be more precise, expected number of N-bit hashes that can be generated before a collision is 2^(N/2), where expectation is a formal first order statistics of the distribution. – Szere Dyeri Nov 18 '08 at 09:28
4

From first principles:

SHA-1 produces a 160-bit digest. Assuming it uses the entire bit-space evenly (which is presumably what it was designed to do), that is only a 2^-160 chance on each insert that you would get a collision.

So for each insert, it should be safe to assume there is no collision, and deal with the error if there is.

That is not to say that you can ignore the chance of collision entirely.

The Birthday Paradox suggests the chance of there being at least one collision in your database is higher than you would guess, because of the O(N^2) possible collisions.

Oddthinking
  • 24,359
  • 19
  • 83
  • 121
  • The Birthday Paradox brings the chance of a collision up to 0.00000000000000000017347234759768070944119244813919%. Really not worth worrying about. – Jeff Hubbard Nov 18 '08 at 07:08
  • 3
    Jeff, I concede that the collision risk can be ignored in almost all cases. I didn't do the maths before. However, you fail to mention how many objects there are in the collection, so your estimation of the chance of collision is kind of meaningless. – Oddthinking Nov 18 '08 at 08:08
1

Ask the question what will it cost you if there is a collision. If this is a free site fine. If you are running a money making business and an overrite will cost you a million dollar contract then I would think again.

I think you are going about this the wrong way.
I think you need to keep the unique ID but you want to make sure that the users can not manually change the ID.

One way to to do this is to put the ID and the hash of the ID (with some extra data) in the link.

For Example: (my PHP is rusty so general algorithm would be:)

id   = 5;
hash = hash("My Private String " + id)
link = "http://mySite.com/resource?id=" + id + "&hash=" + hash

Then when you receive a request just validate that you can regenerate the hash from ID. This does leave you open to an attack to work out "My Private String" but that will be quite computationally hard and you could always append something else unique that is not directly available to the user (like the session ID).

Martin York
  • 257,169
  • 86
  • 333
  • 562
1

There is a very simple rule to find out if any hashing algorithm would have collisions or not. If the output range of an algorithm is a finite number, one is bound to have a collision, sooner or later.

Even though SHA1 has a very large range of 2^160 hash possibilities, its still a finite number. However, the inputs which can be passed on that function are literally infinite. Given a large enough input data set, collisions are bound to happen.

Ketan Patil
  • 1,222
  • 13
  • 21
0

The other comments have covered you on the probabilities, however if you look at this pragmatically then you can get a definite answer for yourself.

You said yourself that you are going to be hashing your sequential IDs. It would be easy to code up a test case. Iterate through ~100,000,000 ids and check for collisions. That would not take long to do. On the other hand you might run out of memory quarter of the way through.

Josh
  • 1,479
  • 10
  • 12
0

I don't think sha1() is going to give you any trouble here, weak random number generation is a more likely candidate for collisions.

Stefan Esser wrote good article on the topic.

Waquo
  • 840
  • 1
  • 5
  • 6
0

What are the chances of a collision?

The exact probability that n hashes collide with S the total number of different possible hashes is:

(perfect behaviour of hash function, birthday paradox, blah blah blah...)

You wont be able to compute this directly, since those are huge numbers, so we use limits and make 2 assumptions:

With those 2 assumptions, the probability of a collision can be computed with:

Now you can compute the probability of a collision for some number n of records. This is very very accurate for anything less than 2^70 records for sha1 (S=2^160), the worse the approximation the more n approach 2^80.

Example

For example if you want to save a huge number of users, specifically the same number as person are in the world (~ 8 billions), and you are using sha1 (S=2^160), the probability of a collision is 2.5e-29 (notice that the 2 assumptions hold). To give you a reference, the probability of wining the Euromillion Jackpot is 7e-9 approx.

Curiosity: What to do with larger (larger?!) numbers?

Compute directly the limit without the second assumption.

For example, the first collision is expected around the square root of S, (in the case of sha1 n=2^80). At that value, the second condition doesn't hold but we can compute the limit directly with:

which is a 40% approx. of probability of a collision.

MrIo
  • 108
  • 1
  • 10