4

I have several billion strings in the format word0.word1.word2, and I wish to perform modulo n on those strings so that I can feed each to a database writer for storage. I know I can perform a form a modulo 10 on the first character of the strings like this:

for i in ["a.b","c.d"]: 
    print ord(i[0]) % 10

This won't divide my strings evenly, though, as word0, word1, and word2 are sorted into alphabetical order, and the first character of the string is very often "a". I could take the last letter of the string, but am not sure if those are normally distributed or not.

My question: Is there a fast way to perform something like "ord" on the entire string? I ultimately plan to run modulo 48 on the integer representations of the strings, and wish for that modular output to be uniformly distributed across all 48 cores. I would be grateful for any help others can offer.

duhaime
  • 25,611
  • 17
  • 169
  • 224

1 Answers1

4
s = "whatever"  # have a string
h = hash(s)     # obtain its hash
bin = h % 48    # find the bin

Update: The Python's built-in hash function provides deterministic values only for a single process. If you want to keep this information (directly or indirectly ) in a database you have to use an explicit hash function that doesn't include any random data. (Credit goes to @Alik)

dlask
  • 8,776
  • 1
  • 26
  • 30
  • Thanks @dlask. I realized halfway through typing out the question that I am indeed after a good hash function. I'll accept this answer and fish around for the optimal implementation of `hash()` – duhaime Jul 20 '15 at 12:57
  • This is not a template, it's a working code. The `hash` function is provided directly by Python. However, you can use another hash function, of course. – dlask Jul 20 '15 at 13:14
  • @duhaime please, read carefully documentation on [`__hash__`](https://docs.python.org/3/reference/datamodel.html#object.__hash__) function. The most important part here is the last note. Also even if you use old Python3 versions or Python 2.7 `__hash__` is not guaranteed to be a [deterministic function](https://en.wikipedia.org/wiki/Hash_function#Determinism). – Konstantin Jul 20 '15 at 13:37
  • Thank you @Alik, this is essential. Do you know of a fast deterministic hash function that would help create a fairly uniform distribution appropriate for my %48 situation? Non-deterministic hashes are out of the question. – duhaime Jul 20 '15 at 13:52
  • @duhaime I am not sure, but I've heard good things about djb2 when it all comes to simplicity and speed. – Konstantin Jul 20 '15 at 14:02
  • @duhaime I asked a friend of mine. His recommendation is https://code.google.com/p/xxhash/ – Konstantin Jul 20 '15 at 14:07
  • @Alik, interesting. It's non-cryptographic--is it deterministic? – duhaime Jul 20 '15 at 14:12
  • 1
    @duhaime all hashes must be deterministic functions. Cryptographic hash function is a _hash_ function which is very hard to revert (i.e. get original input given hash value only). Usually they are slower compared to non-cryptographic hashes (see comparison table on xxHash's homepage - MD5, SHA1 are cryptographic hashes) – Konstantin Jul 20 '15 at 14:18