26

I'm using hashing of strings for seeding random states in the following way:

context = "string"
seed = hash(context) % 4294967295 # This is necessary to keep the hash within allowed seed values
np.random.seed(seed)

This is unfortunately (for my usage) non-deterministic between runs in Python 3.3 and up. I do know that I could set the PYTHONHASHSEED environment variable to an integer value to regain the determinism, but I would probably prefer something that feels a bit less hacky, and won't entirely disregard the extra security added by random hashing. Suggestions?

Jimmy C
  • 9,270
  • 11
  • 44
  • 64
  • 1
    What is the purpose though? Why not to write simply `seed = 42`, unless you *actually want* the seed to be different on different runs? – Alexey May 01 '20 at 11:23
  • 1
    @Alexey presumably because they actually do want the seed to be different when the context is different, but the same when the context is the same. Here, even if the context is the same, the seed will still be different. – Benjamin Sep 06 '21 at 18:19
  • Related: https://stackoverflow.com/questions/64344515/python-consistent-hash-replacement – Albert Sep 19 '22 at 09:33

3 Answers3

12

Use a purpose-built hash function. zlib.adler32() is an excellent choice; alternatively, check out the hashlib module for more options.

  • 20
    Watch out! I found out the hard way, but adler32's purpose is not for hashing, but for error correction. It has a rather high collision probability. Quite a headache to debug. – user2647513 Feb 12 '20 at 16:34
6

Forcing Python's built-in hash to be deterministic is intrinsically hacky. If you want to avoid hackitude, use a different hashing function -- see e.g in Python-2: https://docs.python.org/2/library/hashlib.html, and in Python-3: https://docs.python.org/3/library/hashlib.html

Mazdak
  • 105,000
  • 18
  • 159
  • 188
Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
  • 4
    Isn't a hash supposed to be deterministic ? – nicolas Nov 19 '19 at 15:36
  • 4
    hash() is only deterministic _throughout the same run_, you have no guarantee it will return the same hash in different runs. Hence it's bad for persistence on disk. – Le Frite Dec 30 '19 at 15:45
4

You can actually use a string as seed for random.Random:

>>> import random
>>> r = random.Random('string'); [r.randrange(10) for _ in range(20)]
[0, 6, 3, 6, 4, 4, 6, 9, 9, 9, 9, 9, 5, 7, 5, 3, 0, 4, 8, 1]
>>> r = random.Random('string'); [r.randrange(10) for _ in range(20)]
[0, 6, 3, 6, 4, 4, 6, 9, 9, 9, 9, 9, 5, 7, 5, 3, 0, 4, 8, 1]
>>> r = random.Random('string'); [r.randrange(10) for _ in range(20)]
[0, 6, 3, 6, 4, 4, 6, 9, 9, 9, 9, 9, 5, 7, 5, 3, 0, 4, 8, 1]
>>> r = random.Random('another_string'); [r.randrange(10) for _ in range(20)]
[8, 7, 1, 8, 3, 8, 6, 1, 6, 5, 5, 3, 3, 6, 6, 3, 8, 5, 8, 4]
>>> r = random.Random('another_string'); [r.randrange(10) for _ in range(20)]
[8, 7, 1, 8, 3, 8, 6, 1, 6, 5, 5, 3, 3, 6, 6, 3, 8, 5, 8, 4]
>>> r = random.Random('another_string'); [r.randrange(10) for _ in range(20)]
[8, 7, 1, 8, 3, 8, 6, 1, 6, 5, 5, 3, 3, 6, 6, 3, 8, 5, 8, 4]

It can be convenient, e.g. to use the basename of an input file as seed. For the same input file, the generated numbers will always be the same.

Eric Duminil
  • 52,989
  • 9
  • 71
  • 124