5

I need to produce unique identifiers that can be used in filenames and can be reproduced given the same input values. I need to produce millions of these identifiers as the source input has millions of combinations.

For simplicity's sake, I will use a small set in the example, but the actual sets can be rather large (hundreds, maybe thousands, of items); larger than could be manually encoded into a filename.

I noticed that the 5th method of generating UUID's allows you to provide a string input.

> input_set = {'apple', 'banana', 'orange'}
> uuid.uuid5(uuid.NAMESPACE_URL, pickle.dumps(input_set)).hex
'f39926529ad45997984643816c1bc403'

The documentation says it uses SHA1 under the hood. Is the risk of a collision too high? Is there a better way of reliably hashing unique identifiers?

Asclepius
  • 57,944
  • 17
  • 167
  • 143
Brendan Abel
  • 35,343
  • 14
  • 88
  • 118
  • 2
    Here's one resource that addresses the question of UUID collisions: https://en.wikipedia.org/wiki/Universally_unique_identifier#Collisions –  Dec 01 '17 at 21:17
  • Thanks @blurp though that only deals with version 1 and 2 of generating UUID's. I'm looking to use version 5 which generates them from a input string and namespace identifier. – Brendan Abel Dec 01 '17 at 21:23
  • 1
    The Wikipedia article talks specifically about the chances of a collision with versions 3, 4, and 5. –  Dec 01 '17 at 21:25
  • Thanks @blurp, it seems like the probability would be very small; though, I'm going to leave the question up, because I'm still interested in seeing if there is a better, more canonical way to do this. – Brendan Abel Dec 01 '17 at 21:31
  • What are some examples of your input values? Is there any reason why you cannot use the input values themselves if uniqueness is solely determined by them? – Blender Dec 01 '17 at 21:42
  • @Blender The input values will be sets of strings. They would be too long to include in a filename. – Brendan Abel Dec 01 '17 at 21:47
  • 1
    @BrendanAbel: What kinds of strings? Are they all known in advance? All hash functions are guaranteed to have collisions. While the chances of a collision are extremely low, there might be ways to guarantee that you will never have collisions if you can make assumptions about your input data. – Blender Dec 01 '17 at 21:55
  • @Blender Yes, for this purpose, I will know all the possible strings in advance. – Brendan Abel Dec 01 '17 at 21:57
  • @BrendanAbel: How large can these sets be and how many possible choices are there for each element in the set? – Blender Dec 01 '17 at 22:00
  • @Blender On the order of 100s or 1000s, for both – Brendan Abel Dec 01 '17 at 22:07
  • You're constrained to having ~240-character filenames consisting of about ~40 valid characters, so you have 40^240 possible filenames to work with. This means that you can guarantee uniqueness only if your total number of strings is below 1277. – Blender Dec 01 '17 at 22:26
  • @Blender Considering UUID's are only 128 bits (32 chars), I'd hit a collision long before I got to even a portion of the 2**1277 possible combinations. Luckily, only a small subset of the total possible combinations are valid, probably around a million or so. – Brendan Abel Dec 02 '17 at 00:23
  • 1
    @BrendanAbel: If you can enumerate all the valid combinations and bijectively map sets of strings to their corresponding index, you would only need 20 bits (or 4 alphanumeric characters). If you have only a million valid inputs, it's probably much easier to just use a cryptographic hash function and manually verify that there are no collisions. Since it takes thousands of years of processor time to *deliberately* find one, I doubt you'll *accidentally* find one. – Blender Dec 02 '17 at 01:16
  • what do you think of this? https://stackoverflow.com/questions/534839/how-to-create-a-guid-uuid-in-python – Charlie Parker Jul 25 '22 at 19:06

3 Answers3

6

The odds that you'd get an SHA1 collision from strings is astoundingly low. Currently there are less than 63 known collisions for SHA1.

First ever SHA1 collision found

First ever' SHA-1 hash collision calculated. All it took were five clever brains... and 6,610 years of processor time

SHA1 is no longer considered secure in the cryptography world, but certainly exceeds your expectations here.

Cryptographic hashing functions are designed to be one way functions.This means the functions inverse is "hard" to calculate. (i.e. knowing the output in no way helps you determine the input) As Blender pointed out in the comments this has nothing to do with the chance of collisions.

Take a look at the Birthday Paradox for some basic information on how the probability of a collision is calculated.

This question addresses the likely hood of a SHA1 collision. This article states

A cryptographic hash function has provable security against collision attacks if finding collisions is provably polynomial-time reducible from problem P which is supposed to be unsolvable in polynomial time. The function is then called provably secure, or just provable.

Here is a list of "secure" hash algorithms.

UPDATE You stated in the comments your input is much larger than the 160 bit limit for SHA1. I recommend you use SHA3 in this case as there is no limit on the size of your input. Check out the Python documentation for more information.

Here is a basic example:

import sha3
k = sha3.keccak_512()
k.update(b"data")
k.hexdigest()
'1065aceeded3a5e4412e2187e919bffeadf815f5bd73d37fe00d384fe29f55f08462fdabe1007b993ce5b8119630e7db93101d9425d6e352e22ffe3dcb56b825'
Brendan Abel
  • 35,343
  • 14
  • 88
  • 118
DoesData
  • 6,594
  • 3
  • 39
  • 62
  • Thanks. Though I want to point out that I'm not using filenames as the source input, I'm using very large strings; much larger than the 160 bit SHA1 or the 128 bit UUID strings. – Brendan Abel Dec 01 '17 at 21:51
  • 2
    Collisions are guaranteed to exist for any hash function, regardless of the existence of one-way functions. – Blender Dec 01 '17 at 21:51
  • If the names are longer than maybe consider using a different hash algorithm. Some can handle strings of arbitrary length. – DoesData Dec 01 '17 at 22:20
  • 1
    @BrendanAbel: SHA-1's *output* is 160 bits. Every hash function has a fixed-length output, that's why they're useful. They can take any amount of data and produce a fixed-length hash derived from it. – Blender Dec 02 '17 at 01:22
  • is md5 ok? e.g. https://stackoverflow.com/questions/22974499/generate-id-from-string-in-python – Charlie Parker Jul 25 '22 at 19:05
  • @CharlieParker depends on your use case, but in general MD5 is not secure. Ref - https://security.stackexchange.com/questions/19906/is-md5-considered-insecure – DoesData Aug 10 '22 at 19:21
6

Instead of using pysha3 (see DoesData's answer), you could also use the built-in hashlib:

import hashlib

h = hashlib.sha3_512() # Python 3.6+
h.update(b"Hello World")
h.hexdigest()

Output:

'3d58a719c6866b0214f96b0a67b37e51a91e233ce0be126a08f35fdf4c043c6126f40139bfbc338d44eb2a03de9f7bb8eff0ac260b3629811e389a5fbee8a894'
Thomas
  • 8,357
  • 15
  • 45
  • 81
0

If the smaller base64.urlsafe_b64encode output would be preferable:

> import base64, hashlib

> base64.urlsafe_b64encode(hashlib.sha3_512('asdf'.encode()).digest())
b'jYjPWyD1Os164UebWzbcICF1OwSZAsdyR7snsTGzAL08qL7vKHVtzie4mQhnxFd6JTXn47dRQTmcoalMyEsOuQ=='

The above output is of length 88 whereas the corresponding hex would be of length 128.

Asclepius
  • 57,944
  • 17
  • 167
  • 143