4

I am trying to create a unique record id using the following function:

import hashlib
from base64 import b64encode

def make_uid(salt, pepper, key):
  s = b64encode(salt)
  p = b64encode(pepper)
  k = b64encode(key)
  return hashlib.sha256(s + p + k).hexdigest()

Where pepper is set like this:

uuid_pepper = uuid.uuid4()
pepper = str(uuid_pepper).encode('ascii')

And salt and key are the same values for every request.

My question is, because of the unique nature of the pepper, will make_uid in this intance always return a unique value, or is there a chance that it can create a duplicate?

The suggested answer is different because I'm not asking about the uniqueness of various uuid types, I'm wondering whether it's at all possible for a sha256 hash to create a collision between two distinct inputs.

Community
  • 1
  • 1
mwkrimson
  • 115
  • 1
  • 9
  • Possible duplicate of [When should I use uuid.uuid1() vs. uuid.uuid4() in python?](http://stackoverflow.com/questions/1785503/when-should-i-use-uuid-uuid1-vs-uuid-uuid4-in-python) – m0nhawk Apr 22 '17 at 17:42
  • @m0nhawk- I am not asking whether I should be using uuid1 or uuid4, I am asking if my hexdigest will always be unique in this instance. – mwkrimson Apr 22 '17 at 17:43
  • If you go on the link and read, then you'll found that the accepted answer **also** answers on your question. – m0nhawk Apr 22 '17 at 17:45
  • @m0nhawk I'm not really sure how-- I'm not using a uuid as my record id, but as part of the hash creation. I want to know if the resulting hash will always be unique, not whether uuid4 will be unique. – mwkrimson Apr 22 '17 at 17:46
  • http://stackoverflow.com/questions/2444321/how-are-hash-functions-like-md5-unique – Ignacio Vazquez-Abrams Apr 22 '17 at 17:47
  • @mwkrimson If sometime you'll get the same `uuid_pepper` you will have the same SHA256 (you already said the `key` and `salt` are the same). So, all the uniqueness is on the `uuid4()`. – m0nhawk Apr 22 '17 at 17:49
  • @m0nhawk- I understand the perspective, but I was under the impression that not all hashing algorithms are equal-- that some can return a collision on two distinct inputs (like md5), therefore what my question is, basically, will `hashlib.sha256(input).hexdigest()` always be unique between distinct inputs. – mwkrimson Apr 22 '17 at 17:52
  • @KennyOstrom due to the nature of the project I'm working on, I need record ids to have built-in uniqueness to the user, so I am passing a user secret key in conjuction with a user supplied salt; then they can send these back with the pepper i send on record creation to retrieve the recordid for lookups – mwkrimson Apr 22 '17 at 17:54
  • So this boils down to asking which has a higher collision chance, sha256 or uuid4? They both turn out to be pretty negligible, unless you generate a really large quantity of such records, or have a bad rng init. – Kenny Ostrom Apr 22 '17 at 18:02
  • @KennyOstrom In a way that's the question I'm asking; primarily whether it's possible at all (however unlikely) to have a collision when hashing distinct inputs using sha256 – mwkrimson Apr 22 '17 at 18:05
  • 1
    @mwkrimson Nice [answer](https://crypto.stackexchange.com/questions/3049/are-there-any-known-collisions-for-the-sha-1-2-family-of-hash-functions) also. – m0nhawk Apr 22 '17 at 18:06

1 Answers1

17

I think what you want to know is whether SHA256 is guaranteed to generate a unique hash result. The answer is yes and no. I got the following result from my research, not 100% accurate but close.

In theory, SHA256 will collide. It has 2^256 results. So if we hash 2^256 + 1 times, there must be a collision. Even worse, according to statistics, the possibility of collision within 2^130 times of hashing is 99%.

But you probably won't generate one during your lifetime. Assume we have a computer that can calculate 10,000 hashes per second. It costs this computer 4 * 10^27 years to finish 2^130 hashes. You might not have any idea about how large this number is. The number of years of doing hashing is 2 * 10^22 times of that of human exist on earth. That means that even if you started doing hashing since the first day we were on earth till now, the possibility of collision is still very very small.

Hope that answers your question.

Junbang Huang
  • 1,927
  • 19
  • 26
  • Yes, this is the answer I was looking for-- that theoretically I can have a collision with distinct inputs. It seems ridiculously unlikely, but still possible. Thanks! – mwkrimson Apr 22 '17 at 18:03
  • 1
    I think it's more practical to calculate what number of years will be required to have e.g. 0.0000000001% chance (or something like this), than calculating number of years to have a 99% chance. If the chance is 99% already, then the collision did already happen many times until this moment. So, basically what everybody wants to know is how many years will pass until **first** collision, not until it's already a 99% chance. If even in 100 years chance of collision is still under some threshold value (e.g. 10^(-20) %), then sha256 can be used to generate ID based on data, else it's not. – Ruslan Stelmachenko Nov 22 '19 at 15:55