How to hash a string into 8 digits?

Question

Is there anyway that I can hash a random string into a 8 digit number without implementing any algorithms myself?

8 digit seems to small, and may result in collisions of hashes if you have large number of records. http://stackoverflow.com/questions/1303021/shortest-hash-in-python-to-name-cache-files — DhruvPathak, Apr 15 '13 at 06:19
Any finite number of digits will result in collisions for sufficiently large numbers of hash items, that's why you shouldn't treat them as unique keys - it tends to turn into the birthday problem. — Alex North-Keys, May 17 '17 at 22:27
I've chosen "CityHash" to hash strings to 19 digit long integers (64bit integers), hoping this will lead to less potential collisions than Raymond's suggestion below. https://en.wikipedia.org/wiki/List_of_hash_functions — tryptofame, Jul 21 '17 at 13:05

score 249 · Accepted Answer · edited Nov 30 '20 at 20:27

249

Yes, you can use the built-in hashlib module or the built-in hash function. Then, chop-off the last eight digits using modulo operations or string slicing operations on the integer form of the hash:

>>> s = 'she sells sea shells by the sea shore'

>>> # Use hashlib
>>> import hashlib
>>> int(hashlib.sha1(s.encode("utf-8")).hexdigest(), 16) % (10 ** 8)
58097614L

>>> # Use hash()
>>> abs(hash(s)) % (10 ** 8)
82148974

edited Nov 30 '20 at 20:27

Boris Verkhovskiy

14,854
11
100
103

answered Apr 15 '13 at 06:17

Raymond Hettinger

216,523
63
388
485

54

public service announcement...this technique doesn't actually result in a unique hash value for the string; it computes a hash and then munges into a non-guaranteed-unique value – twneale Sep 18 '15 at 15:03
170

public service announcement...except for the special case of perfect hashes over limited set of input values, hash functions aren't supposed to generate guaranteed unique values. – Raymond Hettinger Sep 19 '15 at 15:39
4

Probably true, but virtually all of their practical utility derives from their their good-enough tendency to produce unique values. The probability of a 'hash' collision using this trick is probably 10 or 11 orders of magnitude higher than md5 – twneale Sep 20 '15 at 23:04
10

Did you read the OP's question? He (or she) wanted (or needed) 8 decimal places. Also, the way hash tables work is to hash into a small search space (the sparse table). You seem to not know want hash functions are commonly used for and to not care about the actual question that was asked. – Raymond Hettinger Sep 21 '15 at 03:19
24

I read the question. I'm simply observing that over the same input space as SHA-1, your answer is astronomically more likely to produce a collision than not. At least some degree of uniqueness is implicitly required by the question, but your answer is a hash function in the same spirit as one that simply returns 12345678 for every input. I was able to experimentally generate a collision with as few as 1000 inputs using this method. To preserve the same collision probability as SHA-1, you would have to map un-truncated SHA-1's to 8-digit integers. I think that's worthy of a PSA – twneale Sep 21 '15 at 15:58
38

Careful, hash(s) is not guarateed to give same results across platforms and runs. – Mr. Napik Feb 16 '16 at 21:33
4

Is `abs` needed? Modulo should return a positive int. – Doug Apr 29 '16 at 00:02
In Python2, ``hash('agir')`` is ``-2835743962885600615``. – Raymond Hettinger Apr 29 '16 at 18:13
Right, but what I think Doug meant is that even a negative number mod something will always produce a positive number, so it seems you can drop the abs(). Also, I think the relative operator precedence of exponentiation means we don't even need the second parens. Thanks for the answer, though! >>> hash(s) % 10**8 produces 57227199 – JJC Feb 07 '17 at 10:41
4

An important caveat is that, unlike with Python 2.x, hash(x) returns a different value on each Python 3.x interpreter invocation (it is consistent within a single process). So, if the OP is depending on the hash to be the same for a given string across script runs, the latter will not work in Python 3.x. This just bit me. I will add an answer to reflect these two comments (not yet sure about etiquette of editing). – JJC Feb 07 '17 at 11:37
Should use `1e8` instead of `10**8` you're performing an extra computation when there is absolutely no need. Also, nice answer, it's exactly what I was looking for. – silgon Nov 02 '18 at 16:24
5

@silgon Python's peephole optimizer does constant folding, so the computation is only done once. That is easy to verify. Run ``dis(compile('10 ** 8', '', 'eval'))`` and look for the fragment ``LOAD_CONST 0 (100000000)``. Alternatively, run ``def f(): return 10**8`` and observe that ``f.__code__.co_consts`` returns ``(None, 100000000)``. Notes that ``10E8`` isn't a valid substitute because that is a *float* rather than an *int*. – Raymond Hettinger Nov 02 '18 at 22:05
1

Wow... I just checked what you said, you're right, and it's really interesting, I thought that the power operation `**` would always run an operation, however as you said, it's `LOAD_CONST`. Thanks for the interesting reply. Also, you're right, the scientific notation `1e8` gives a float. – silgon Nov 03 '18 at 09:24
Some of the comments regarding 'unique value' are confusing. I am trying to do same thing, and tested in Python 3.7.4 and 3.5.3 on two different machines. For the same input string, the result are the same. Is it true that the same input string always results in the same output for `hashlib.sha1` ? – user1783732 Aug 10 '19 at 00:04
If your extracted 8 digits start with a 0, you'll end up with a 7 digit number. – ingo Apr 02 '20 at 08:55
1

I think it is worth mentioning that if you want a stable hash you you should use the `hashlib` solution. – kaptan Sep 24 '21 at 21:20

score 164 · Answer 2 · answered Feb 07 '17 at 11:57

Raymond's answer is great for python2 (though, you don't need the abs() nor the parens around 10 ** 8). However, for python3, there are important caveats. First, you'll need to make sure you are passing an encoded string. These days, in most circumstances, it's probably also better to shy away from sha-1 and use something like sha-256, instead. So, the hashlib approach would be:

>>> import hashlib
>>> s = 'your string'
>>> int(hashlib.sha256(s.encode('utf-8')).hexdigest(), 16) % 10**8
80262417

If you want to use the hash() function instead, the important caveat is that, unlike in Python 2.x, in Python 3.x, the result of hash() will only be consistent within a process, not across python invocations. See here:

$ python -V
Python 2.7.5
$ python -c 'print(hash("foo"))'
-4177197833195190597
$ python -c 'print(hash("foo"))'
-4177197833195190597

$ python3 -V
Python 3.4.2
$ python3 -c 'print(hash("foo"))'
5790391865899772265
$ python3 -c 'print(hash("foo"))'
-8152690834165248934

This means the hash()-based solution suggested, which can be shortened to just:

hash(s) % 10**8

will only return the same value within a given script run:

#Python 2:
$ python2 -c 's="your string"; print(hash(s) % 10**8)'
52304543
$ python2 -c 's="your string"; print(hash(s) % 10**8)'
52304543

#Python 3:
$ python3 -c 's="your string"; print(hash(s) % 10**8)'
12954124
$ python3 -c 's="your string"; print(hash(s) % 10**8)'
32065451

So, depending on if this matters in your application (it did in mine), you'll probably want to stick to the hashlib-based approach.

It should be noted that this answer has a very important caveat since Python 3.3, to protect against tar-pitting Python 3.3 and above use a random hash seed upon startup. — Wolph, Jan 06 '18 at 13:20
If digits are not your main requirement you could also use `hashlib.sha256("hello world".encode('utf-8')).hexdigest()[:8]` witch still will have collisions — lony, Dec 17 '18 at 16:41

score 11 · Answer 3 · answered Nov 15 '17 at 22:02

Just to complete JJC answer, in python 3.5.3 the behavior is correct if you use hashlib this way:

$ python3 -c '
import hashlib
hash_object = hashlib.sha256(b"Caroline")
hex_dig = hash_object.hexdigest()
print(hex_dig)
'
739061d73d65dcdeb755aa28da4fea16a02b9c99b4c2735f2ebfa016f3e7fded
$ python3 -c '
import hashlib
hash_object = hashlib.sha256(b"Caroline")
hex_dig = hash_object.hexdigest()
print(hex_dig)
'
739061d73d65dcdeb755aa28da4fea16a02b9c99b4c2735f2ebfa016f3e7fded

$ python3 -V
Python 3.5.3

score 8 · Answer 4 · answered Oct 13 '22 at 08:32

As of Python 3.10 another quick way of hashing string to an 8 hexadecimal digit digest is to use shake.hexdigest(4) :

import hashlib
h=hashlib.shake_128(b"my ascii string").hexdigest(4)
#34c0150b

Mind the 4 instead of 8 because the digest is twice as long as the number given as parameter.

Of course be aware of hash collisions.

How to hash a string into 8 digits?

4 Answers4

Linked

Related