32

I have a set of ASCII strings, let's say they are file paths. They could be both short and quite long.

I'm looking for an algorithm that could calculate hash of such a strings and this hash will be also a string, but will have a fixed length, like youtube video ids:

https://www.youtube.com/watch?v=-F-3E8pyjFo
                                ^^^^^^^^^^^

MD5 seems to be what I need, but it is critical for me to have a short hash strings.

Is there a shell command or python library which can do that?

Milad Nouri
  • 1,587
  • 1
  • 21
  • 32
Anthony
  • 12,407
  • 12
  • 64
  • 88
  • You mean apart from the standard `md5` module? (Deprecated, though; now you can use `hashlib` instead) – Ricardo Cárdenes Feb 24 '14 at 22:01
  • The question is more about the algorithm rather than implementation – Anthony Feb 24 '14 at 22:09
  • How critical is it for you to have no collisions, and how critical is speed? MD5 is not actually very fast and also not very short, compared to other algorithms. You can calculate the risk of collisions using the Birthday Paradox formula (see Wikipedia). – Thomas Mueller Feb 25 '14 at 08:05

5 Answers5

21

As of Python 3 this method does not work:

Python has a built-in hash() function that's very fast and perfect for most uses:

>>> hash("dfds")
3591916071403198536

You can then make it unsigned:

>>> hashu=lambda word: ctypes.c_uint64(hash(word)).value

You can then turn it into a 16 byte hex string:

>>> hashu("dfds").to_bytes(8,"big").hex()

Or an N*2 byte string, where N is <= 8:

>>> hashn=lambda word, N  : (hashu(word)%(2**(N*8))).to_bytes(N,"big").hex()

..etc. And if you want N to be larger than 8 bytes, you can just hash twice. Python's built-in is so vastly faster, it's never worth using hashlib for anything unless you need security... not just collision resistance.

>>> hashnbig=lambda word, N  : ((hashu(word)+2**64*hashu(word+"2"))%(2**(N*8))).to_bytes(N,"big").hex()

And finally, use the urlsafe base64 encoding to make a much better string than "hex" gives you

>>> hashnbigu=lambda word, N  : urlsafe_b64encode(((hashu(word)+2**64*hash(word+"2"))%(2**(N*8))).to_bytes(N,"big")).decode("utf8").rstrip("=")
>>> hashnbigu("foo",16)
'ZblnvrRqHwAy2lnvrR4HrA'

Caveats:

  • Be warned that in Python 3.3 and up, this function is randomized and won't work for some use cases. You can disable this with PYTHONHASHSEED=0

  • See https://github.com/flier/pyfasthash for fast, stable hashes that that similarly won't overload your CPU for non-cryptographic applications.

  • Don't use this lambda style in real code... write it out! And stuffing things like 2**32 in your code, instead of making them constants is bad form.

  • In the end 8 bytes of collision resistance is OK for a smaller applications.... with less than a million entries, you've got collision odds of < 0.0000001%. That's a 12 byte b64 encoded string. But it might not be enough for larger apps.

  • 16 bytes is enough for a UUID/OID in a cache, etc.

Speed comparison for producing 300k 16 byte hashes from a bytes-input.

builtin: 0.188
md5: 0.359
fnvhash_c: 0.113

For a complex input (tuple of 3 integers, for example), you have to convert to bytes to use the non-builtin hashes, this adds a lot of conversion overhead, making the builtin shine.

builtin: 0.197
md5: 0.603
fnvhash_c: 0.284
Selali Adobor
  • 2,060
  • 18
  • 30
Erik Aronesty
  • 11,620
  • 5
  • 64
  • 44
  • 31
    In Python 3 this function is randomized and this may be a problem in some cases. – Tim Jul 13 '18 at 13:55
  • 4
    thanks @Tim, from the docs: By default, the __hash__() values of `str`, `bytes` and `datetime` objects are “salted” with an unpredictable random value; set environment variable `PYTHONHASHSEED=0` to disable randomization [...] to allow a cluster of python processes to share hash values. – Rabash Aug 29 '19 at 01:52
  • hash('asd').to_bytes(8, 'little') OverflowError: can't convert negative int to unsigned – iperov Mar 04 '21 at 15:27
  • @iperov fixed ... pass `,signed=True` to the `to_bytes` – Erik Aronesty Mar 10 '21 at 20:00
  • @iperov better to make the hash unsigned. ctypes seems the only clean way. – Erik Aronesty Mar 10 '21 at 20:19
  • What does it mean "won't break your CPU"? Is the described above method dangerous in some way? – Boris Kalinin Feb 07 '22 at 08:18
  • 1
    @BorisKalinin i will clarify above: the method above is great, and is fast. md5 and other cryptographic hashes use too much cpu that's all – Erik Aronesty May 23 '22 at 15:37
13

I guess this question is off-topic, because opinion based, but at least one hint for you, I know the FNV hash because it is used by The Sims 3 to find resources based on their names between the different content packages. They use the 64 bits version, so I guess it is enough to avoid collisions in a relatively large set of reference strings. The hash is easy to implement, if no module satisfies you (pyfasthash has an implementation of it for example).

To get a short string out of it, I would suggest you use base64 encoding. For example, this is the size of a base64-encoded 64 bits hash: nsTYVQUag88= (and you can get rid or the padding =).

Edit: I had finally the same problem as you, so I implemented the above idea: https://gist.github.com/Cilyan/9424144

Cilyan
  • 7,883
  • 1
  • 29
  • 37
4

Another option: hashids is designed to solve exactly this problem and has been ported to many languages, including Python. It's not really a hash in the sense of MD5 or SHA1, which are one-way; hashids "hashes" are reversable.

You are responsible for seeding the library with a secret value and selecting a minimum hash length.

Once that is done, the library can do two-way mapping between integers (single integers, like a simple primary key, or lists of integers, to support things like composite keys and sharding) and strings of the configured length (or slightly more). The alphabet used for generating "hashes" is fully configurable.

I have provided more details in this other answer.

Community
  • 1
  • 1
ChrisGPT was on strike
  • 127,765
  • 105
  • 273
  • 257
1

You could use the sum program (assuming you're on linux) but keep in mind that the shorter the hash the more collisions you might have. You can always truncate MD5/SHA hashes as well.

EDIT: Here's a list of hash functions: List of hash functions

eugecm
  • 1,189
  • 8
  • 14
  • 1
    That's covered here: [link](http://stackoverflow.com/questions/8184941/uniform-distribution-of-truncated-md5) – eugecm Feb 24 '14 at 22:10
0

Something to keep in mind is that hash codes are one way functions - you cannot use them for "video ids" as you cannot go back from the hash to the original path. Quite apart from anything else hash collisions are quite likely and you end up with two hashes both pointing to the same video instead of different ones.

To create an Id like the youtube one the easiest way is to create a unique id however you normally do that (for example an auto key column in a database) and then map that to a unique string in a reversible way.

For example you could take an integer id and map it to 0-9a-z in base 36...or even 0-9a-zA-Z in base 62, padding the generated string out to the desired length if the id on its own does not give enough characters.

Tim B
  • 40,716
  • 16
  • 83
  • 128