Convert string to random but deterministically repeatable uniform probability

Question

How do I convert a string, e.g. a user ID plus salt, to a random looking but actually a deterministically repeatable uniform probability in the semi-open range [0.0, 1.0)? This means that the output is ≥ 0.0 and < 1.0. The output distribution must be uniform irrespective of the input distribution. For example, if the input string is 'a3b2Foobar', the output probability could repeatably be 0.40341504.

Cross-language and cross-platform algorithmic reproducibility is desirable. I'm inclined to use a hash function unless there is a better way. Here is what I have:

>>> in_str = 'a3b2Foobar'
>>> (int(hashlib.sha256(in_str.encode()).hexdigest(), 16) % 1e8) / 1e8
0.40341504

I'm using the latest stable Python 3. Please note that this question is similar but not exactly identical to the related question to convert an integer to a random but deterministically repeatable choice.

Actually your question is not different from "convert an integer to a random but deterministically repeatable choice" -- once you've converted your string to an integer equivalent via hashing, it is exactly the same problem. — Chris Johnson, Jun 21 '17 at 20:27
@ChrisJohnson Okay, but the answer is meaningfully different. One uses modulus, the other doesn't. Having said that, I think the use of modulus can be entirely avoided in the other answer -- by instead linearly scaling the hash value to the number of available choices. — Asclepius, Jun 23 '17 at 00:27

Asclepius · Accepted Answer · 2021-01-19T20:21:26.153

Using hash

A cryptographic hash is assumably a uniformly distributed integer in the range [0, MAX_HASH]. Accordingly, it can be scaled to a floating-point number in the range [0, 1) by dividing it by MAX_HASH + 1.

import hashlib

Hash = hashlib.sha512
MAX_HASH_PLUS_ONE = 2**(Hash().digest_size * 8)

def str_to_probability(in_str):
    """Return a reproducible uniformly random float in the interval [0, 1) for the given string."""
    seed = in_str.encode()
    hash_digest = Hash(seed).digest()
    hash_int = int.from_bytes(hash_digest, 'big')  # Uses explicit byteorder for system-agnostic reproducibility
    return hash_int / MAX_HASH_PLUS_ONE  # Float division

>>> str_to_probability('a3b2Foobar')
0.3659629991207491

Here is a real world usage example.

Notes:

The built-in hash method must not be used because it can preserve the input's distribution, e.g. with hash(123). Alternatively, it can return values that differ when Python is restarted, e.g. with hash('123').
Using modulo is not necessary as float division is sufficient.

Using random

The random module can be used with in_str as its seed, while addressing concerns surrounding both thread safety and continuity.

With this approach, not only is cross-language reproducibility a concern, but reproducibility across multiple future versions of Python could also be a concern. It is therefore not recommended.

import random

def str_to_probability(in_str):
    """Return a reproducible uniformly random float in the interval [0, 1) for the given seed."""
    return random.Random(in_str).random()

>>> str_to_probability('a3b2Foobar')
0.4662507245848473

I agree with the hashlib solution. Especially since SHA512 is going to be implemented on multiple platforms. SHA is used for encryption, and therefore will be the closest to being random-yet-repeatable you're going to get. Although you could look into other encryption schemes, and most importantly, you should never-ever-ever implement your own. — VoNWooDSoN, Jun 20 '17 at 18:48
This answer assumes Python 3. In python 2 you need to cast one of the inputs to a float to get a float division. — SpliFF, Jun 23 '17 at 01:47

Convert string to random but deterministically repeatable uniform probability

1 Answers1

Using hash

Using random

Linked