Python: Performing Modulo on Strings

Question

I have several billion strings in the format word0.word1.word2, and I wish to perform modulo n on those strings so that I can feed each to a database writer for storage. I know I can perform a form a modulo 10 on the first character of the strings like this:

for i in ["a.b","c.d"]: 
    print ord(i[0]) % 10

This won't divide my strings evenly, though, as word0, word1, and word2 are sorted into alphabetical order, and the first character of the string is very often "a". I could take the last letter of the string, but am not sure if those are normally distributed or not.

My question: Is there a fast way to perform something like "ord" on the entire string? I ultimately plan to run modulo 48 on the integer representations of the strings, and wish for that modular output to be uniformly distributed across all 48 cores. I would be grateful for any help others can offer.

So basically you are looking for a good [hash function](https://en.wikipedia.org/wiki/Hash_function) I think your best bet would be to take an existing implementation — Konstantin, Jul 20 '15 at 12:10
http://stackoverflow.com/questions/5297448/how-to-get-md5-sum-of-a-string — boardrider, Jul 20 '15 at 12:14
@boardrider md5 is a cryptographic hash function, not the one you want to use in hashmap-like structures. — Konstantin, Jul 20 '15 at 12:18

dlask · Accepted Answer · 2015-07-20T13:49:47.847

4

s = "whatever"  # have a string
h = hash(s)     # obtain its hash
bin = h % 48    # find the bin

Update: The Python's built-in hash function provides deterministic values only for a single process. If you want to keep this information (directly or indirectly ) in a database you have to use an explicit hash function that doesn't include any random data. (Credit goes to @Alik)

edited Jul 20 '15 at 13:49

answered Jul 20 '15 at 12:12

dlask

8,776
1
26
30

Thanks @dlask. I realized halfway through typing out the question that I am indeed after a good hash function. I'll accept this answer and fish around for the optimal implementation of `hash()` – duhaime Jul 20 '15 at 12:57
This is not a template, it's a working code. The `hash` function is provided directly by Python. However, you can use another hash function, of course. – dlask Jul 20 '15 at 13:14
@duhaime please, read carefully documentation on [`__hash__`](https://docs.python.org/3/reference/datamodel.html#object.__hash__) function. The most important part here is the last note. Also even if you use old Python3 versions or Python 2.7 `__hash__` is not guaranteed to be a [deterministic function](https://en.wikipedia.org/wiki/Hash_function#Determinism). – Konstantin Jul 20 '15 at 13:37
Thank you @Alik, this is essential. Do you know of a fast deterministic hash function that would help create a fairly uniform distribution appropriate for my %48 situation? Non-deterministic hashes are out of the question. – duhaime Jul 20 '15 at 13:52
@duhaime I am not sure, but I've heard good things about djb2 when it all comes to simplicity and speed. – Konstantin Jul 20 '15 at 14:02
@duhaime I asked a friend of mine. His recommendation is https://code.google.com/p/xxhash/ – Konstantin Jul 20 '15 at 14:07
@Alik, interesting. It's non-cryptographic--is it deterministic? – duhaime Jul 20 '15 at 14:12
1

@duhaime all hashes must be deterministic functions. Cryptographic hash function is a _hash_ function which is very hard to revert (i.e. get original input given hash value only). Usually they are slower compared to non-cryptographic hashes (see comparison table on xxHash's homepage - MD5, SHA1 are cryptographic hashes) – Konstantin Jul 20 '15 at 14:18

Python: Performing Modulo on Strings

1 Answers1