13

I'm struggling a bit to generate ID of type integer for given string in Python.

I thought the built-it hash function is perfect but it appears that the IDs are too long sometimes. It's a problem since I'm limited to 64bits as maximum length.

My code so far: hash(s) % 10000000000. The input string(s) which I can expect will be in range of 12-512 chars long.

Requirements are:

  • integers only
  • generated from provided string
  • ideally up to 10-12 chars long (I'll have ~5 million items only)
  • low probability of collision..?

I would be glad if someone can provide any tips / solutions.

mlen108
  • 456
  • 1
  • 4
  • 12
  • curious, how does this compare to things suggested here: https://stackoverflow.com/questions/534839/how-to-create-a-guid-uuid-in-python? – Charlie Parker Jul 25 '22 at 19:04
  • why use md5 vs sha3? https://stackoverflow.com/questions/47601592/safest-way-to-generate-a-unique-hash – Charlie Parker Jul 25 '22 at 19:05
  • fyi you might need this: https://stackoverflow.com/questions/7585307/how-to-correct-typeerror-unicode-objects-must-be-encoded-before-hashing – Charlie Parker Jul 25 '22 at 19:31

3 Answers3

16

I would do something like this:

>>> import hashlib
>>> m = hashlib.md5()
>>> m.update("some string")
>>> str(int(m.hexdigest(), 16))[0:12]
'120665287271'

The idea:

  1. Calculate the hash of a string with MD5 (or SHA-1 or ...) in hexadecimal form (see module hashlib)
  2. Convert the string into an integer and reconvert it to a String with base 10 (there are just digits in the result)
  3. Use the first 12 characters of the string.

If characters a-f are also okay, I would do m.hexdigest()[0:12].

Stephan Kulla
  • 4,739
  • 3
  • 26
  • 35
  • Thanks, it looks great! It does not return integer but it just a matter of casting it back to int. Would be nice if we could go away with the int/str/int coerce dance. Any idea? :) – mlen108 Apr 09 '14 at 22:28
  • ``m.hexdigit()`` provides a string with 32 characters. So the maximum value is ``'f'*32`` with 39 digits (=``len(str(int('f'*32,16)))``). So You can divide by 1E17 in the end. With this solution collisions might be more probably... But I did not thought it through... – Stephan Kulla Apr 09 '14 at 22:52
  • ``m.hexdigit()`` provides ``m.digest_size * 2`` characters (this might change, depending on the hash function you want to use) – Stephan Kulla Apr 09 '14 at 22:53
  • Note: you can also use the string [digest()](https://docs.python.org/2/library/hashlib.html#hashlib.hash.digest), slice enough bytes from them and convert it to an integer (better to say: interpreting the byte string as an integer) – Stephan Kulla Apr 11 '14 at 09:03
  • curious, how does this compare to things suggested here: https://stackoverflow.com/questions/534839/how-to-create-a-guid-uuid-in-python? – Charlie Parker Jul 25 '22 at 19:04
  • why use md5 vs sha3? https://stackoverflow.com/questions/47601592/safest-way-to-generate-a-unique-hash – Charlie Parker Jul 25 '22 at 19:05
  • fyi you might need this: https://stackoverflow.com/questions/7585307/how-to-correct-typeerror-unicode-objects-must-be-encoded-before-hashing – Charlie Parker Jul 25 '22 at 19:29
  • MD5 is a bit of a sledgehammer to hash a string. Nuclear weapons are a good way to dig a hole, too. – doug65536 Jul 27 '22 at 13:41
1

If you're not allowed to add extra dependency, you can continue using hash function in the following way:

>>> my_string = "whatever"
>>> str(hash(my_string))[1:13]
'460440266319'

NB:

  • I am ignoring 1st character as it may be the negative sign.
  • hash may return different values for same string, as PYTHONHASHSEED Value will change everytime you run your program. You may want to set it to some fixed value. Read here
Aditya Shaw
  • 323
  • 4
  • 11
1

encode utf-8 was needed for mine to work:

def unique_name_from_str(string: str, last_idx: int = 12) -> str:
    """
    Generates a unique id name
    refs:
    - md5: https://stackoverflow.com/questions/22974499/generate-id-from-string-in-python
    - sha3: https://stackoverflow.com/questions/47601592/safest-way-to-generate-a-unique-hash
    (- guid/uiid: https://stackoverflow.com/questions/534839/how-to-create-a-guid-uuid-in-python?noredirect=1&lq=1)
    """
    import hashlib
    m = hashlib.md5()
    string = string.encode('utf-8')
    m.update(string)
    unqiue_name: str = str(int(m.hexdigest(), 16))[0:last_idx]
    return unqiue_name

see my ultimate-utils python library.

Charlie Parker
  • 5,884
  • 57
  • 198
  • 323