11

How do I convert a string, e.g. a user ID plus salt, to a random looking but actually a deterministically repeatable uniform probability in the semi-open range [0.0, 1.0)? This means that the output is ≥ 0.0 and < 1.0. The output distribution must be uniform irrespective of the input distribution. For example, if the input string is 'a3b2Foobar', the output probability could repeatably be 0.40341504.

Cross-language and cross-platform algorithmic reproducibility is desirable. I'm inclined to use a hash function unless there is a better way. Here is what I have:

>>> in_str = 'a3b2Foobar'
>>> (int(hashlib.sha256(in_str.encode()).hexdigest(), 16) % 1e8) / 1e8
0.40341504

I'm using the latest stable Python 3. Please note that this question is similar but not exactly identical to the related question to convert an integer to a random but deterministically repeatable choice.

Asclepius
  • 57,944
  • 17
  • 167
  • 143
  • Actually your question is not different from "convert an integer to a random but deterministically repeatable choice" -- once you've converted your string to an integer equivalent via hashing, it is exactly the same problem. – Chris Johnson Jun 21 '17 at 20:27
  • @ChrisJohnson Okay, but the answer is meaningfully different. One uses modulus, the other doesn't. Having said that, I think the use of modulus can be entirely avoided in the other answer -- by instead linearly scaling the hash value to the number of available choices. – Asclepius Jun 23 '17 at 00:27

1 Answers1

22

Using hash

A cryptographic hash is assumably a uniformly distributed integer in the range [0, MAX_HASH]. Accordingly, it can be scaled to a floating-point number in the range [0, 1) by dividing it by MAX_HASH + 1.

import hashlib

Hash = hashlib.sha512
MAX_HASH_PLUS_ONE = 2**(Hash().digest_size * 8)

def str_to_probability(in_str):
    """Return a reproducible uniformly random float in the interval [0, 1) for the given string."""
    seed = in_str.encode()
    hash_digest = Hash(seed).digest()
    hash_int = int.from_bytes(hash_digest, 'big')  # Uses explicit byteorder for system-agnostic reproducibility
    return hash_int / MAX_HASH_PLUS_ONE  # Float division

>>> str_to_probability('a3b2Foobar')
0.3659629991207491

Here is a real world usage example.

Notes:

  • The built-in hash method must not be used because it can preserve the input's distribution, e.g. with hash(123). Alternatively, it can return values that differ when Python is restarted, e.g. with hash('123').
  • Using modulo is not necessary as float division is sufficient.

Using random

The random module can be used with in_str as its seed, while addressing concerns surrounding both thread safety and continuity.

With this approach, not only is cross-language reproducibility a concern, but reproducibility across multiple future versions of Python could also be a concern. It is therefore not recommended.

import random

def str_to_probability(in_str):
    """Return a reproducible uniformly random float in the interval [0, 1) for the given seed."""
    return random.Random(in_str).random()

>>> str_to_probability('a3b2Foobar')
0.4662507245848473
Asclepius
  • 57,944
  • 17
  • 167
  • 143
  • 3
    I agree with the hashlib solution. Especially since SHA512 is going to be implemented on multiple platforms. SHA is used for encryption, and therefore will be the closest to being random-yet-repeatable you're going to get. Although you could look into other encryption schemes, and most importantly, you should never-ever-ever implement your own. – VoNWooDSoN Jun 20 '17 at 18:48
  • 1
    This answer assumes Python 3. In python 2 you need to cast one of the inputs to a float to get a float division. – SpliFF Jun 23 '17 at 01:47