1

I want to replace an existing random number based data generator (in Python) with a hash based one so that it no longer needs to generate everything in sequence, as inspired by this article.

I can create a float from 0 to 1 by taking the integer version of the hash and dividing it by the maximum value of a hash.

I can create a flat integer range by taking the float and multiplying by the flat range. I could probably use modulo and live with the bias, as the hash range is large and my flat ranges are small.

How could I use the hash to create a gaussian or normal distributed floating point value?

For all of these cases, would I be better off just using my hash as a seed for a new random.Random object and using the functions in that class to generate my numbers and rely on them to get the distribution characteristics right?

At the moment, my code is structured like this:

num_people = randint(1,100)
people = [dict() for x in range(num_people)]
for person in people:
    person['surname'] = choice(surname_list)
    person['forename'] = choice(forename_list)

The problem is that for a given seed to be consistent, I have to generate all the people in the same order, and I have to generate the surname then the forename. If I add a middle name in between the two then the generated forenames will change, as will all the names of all the subsequent people.

I want to structure the code like this:

h1_groupseed=1

h2_peoplecount=1
h2_people=2

h4_surname=1
h4_forename=2

num_people = pghash([h1_groupseed,h2_peoplecount]).hashint(1,100)
people = [dict() for x in range(num_people)]
for h3_index, person in enumerate(people,1):
    person['surname'] = surname_list[pghash([h1_groupseed,h2_people,h3_index,h4_surname]).hashint(0, num_of_surnames - 1)]
    person['forename'] = forename_list[pghash([h1_groupseed,h2_people,h3_index,h4_forename]).hashint(0, num_of_forenames - 1)]

This would use the values passed to pghash to generate a hash, and use that hash to somehow create the pseudorandom result.

PhilHibbs
  • 859
  • 1
  • 13
  • 30
  • Why do you want to do this? – Reblochon Masque Oct 19 '17 at 13:54
  • You can use Box Muller transformation to change uniformly distributed variables into normal ones. https://en.wikipedia.org/wiki/Box%E2%80%93Muller_transform – WNG Oct 19 '17 at 13:55
  • @ReblochonMasque Because I want to make the data generator robust against changes to the order of generation of attributes. – PhilHibbs Oct 19 '17 at 14:28
  • Still unclear on the why (and i can't map the first sentence to the last comment on the why; let alone understand any of those). This might also be more work than you want to do if doing it right (bits -> floats). – sascha Oct 19 '17 at 15:28
  • Could you maybe show some simple code demonstrating how you'd use this functionality? – Mark Dickinson Oct 19 '17 at 15:56
  • Possible duplicate of [Converting a Uniform Distribution to a Normal Distribution](https://stackoverflow.com/questions/75677/converting-a-uniform-distribution-to-a-normal-distribution) – pjs Oct 19 '17 at 19:11
  • I'd feel a lot less dubious about the article that inspired you if it were a peer reviewed journal publication rather than a blog post. While both can contain incorrect info, the odds that are greatly reduced by the peer review publication process. – pjs Oct 20 '17 at 15:01
  • So if I have a 64-bit hash that is my "random number", I suppose I could split that into two halves, convert each 32-bit half to a number, and then use that pair of numbers in the Box-Muller transform. – PhilHibbs Oct 23 '17 at 12:38

3 Answers3

1

First, a big caveat: DO NOT ROLL YOUR OWN CRYPTO. If you're trying to do this for security purposes, DON'T.

Next, check out this question which lists several ways to do what you want, i.e. transform a random uniform variable into a normal one: Converting a Uniform Distribution to a Normal Distribution

pills
  • 656
  • 1
  • 5
  • 10
1

Unless you're doing this for your own amusement or as a learning exercise, my very strong advice is don't do this.

PRNGs have the same general structure, even if the details are wildly different. They map a seed value s into an initial state S via some function f: S←f(s); they then iterate states via some transformation h: Si+1←h(Si); and finally they map the state to an output U via some function g: Ui←g(Si). (For simple PRNGs, f() or g() are often identity functions. For more sophisticated generators such as Mersenne Twister, more is involved.)

The state transition function h() is designed to distribute new states uniformly across the state space. In other words, it's already a hash function, but with the added benefit that for any widely accepted generator it has been heavily vetted by experts to have good statistical behavior.

Mersenne Twister, Python's default PRNG, has been mathematically proven to have k-tuples be jointly uniformly distributed for all k ≤ 623. I'm guessing that whatever hash function you choose can't make such claims. Additionally, the collapsing function g() should preserve uniformity in the outcomes. You've proposed that you "can use the integer version of the hash to create a flat number range, just by taking the modulus." In general this will introduce modulo bias, so you won't end up with a uniformly distributed result.

If you stick with the built-in PRNG, there's no reason not to use the built-in Gaussian generator. If you want to do it for your own amusement there are lots of resources that will tell you how to map uniforms to Gaussians. Well-known methods include the Box-Muller method, Marsaglia's polar method, and the ziggurat method.


UPDATE

Given the additional information you've provided in your question, I think the answer you want is contained in this section of Python's documentation for random:

The functions supplied by this module are actually bound methods of a hidden instance of the random.Random class. You can instantiate your own instances of Random to get generators that don’t share state. This is especially useful for multi-threaded programs, creating a different instance of Random for each thread, and using the jumpahead() method to make it likely that the generated sequences seen by each thread don’t overlap.

Sounds like you want separate instances of Random for each person, seeded independently of each other or with synchronized but widely separated states as described in the random.jumpahead() documentation. This is one of the approaches that simulation modelers have used since the early 1950's so they can maintain repeatability between configurations to make direct comparisons of two or more systems in a fair fashion. Check out the discussion of "synchronization" on the second page of this article, or starting on page 8 of this book chapter, or pick up any of the dozens of simulation textbooks available in most university libraries and read the sections on "common random numbers." (I'm not pointing you towards Wikipedia because it provides almost no details on this topic.)

Here's an explicit example showing creating multiple instances of Random:

import random as rnd

print("two PRNG instances with identical seeding produce identical results:")
r1 = rnd.Random(12345)
r2 = rnd.Random(12345)
for _ in range(5):
    print([r1.normalvariate(0, 1), r2.normalvariate(0, 1)])

print("\ndifferent seeding yields distinct but reproducible results:")
r1 = rnd.Random(12345)
r2 = rnd.Random(67890)
for _ in range(3):
    print([r1.normalvariate(0, 1), r2.normalvariate(0, 1)])
print("\nresetting, different order of operations")
r1 = rnd.Random(12345)
r2 = rnd.Random(67890)
print("r1: ", [r1.normalvariate(0, 1) for _ in range(3)])
print("r2: ", [r2.normalvariate(0, 1) for _ in range(3)])
pjs
  • 18,696
  • 4
  • 27
  • 56
  • So I should use the built in random module, but use the hash as a fresh seed every time? That makes sense. I hope the cost of constructing a new Random instance every time isn't too great. – PhilHibbs Oct 20 '17 at 09:49
  • @PhilHibbs No! The good distributional properties come from h() and g() transformations built into the PRNG, not from seeding. Seeding is expensive for Mersenne Twister, and doing it repeatedly can actually harm the distributional properties the PRNG designers worked so hard to give you. (Do a search for all of the "why does random keep giving the same value" type posts here on SO.) Don't reseed, unless you REALLY REALLY know what you're doing and have a very good reason to do so. – pjs Oct 20 '17 at 13:28
  • 1
    My ["good reason"](https://blogs.unity3d.com/2015/01/07/a-primer-on-repeatable-random-numbers/) is that I don't want to have to generate all the data in exactly the same sequence every time. Looking at my example code that I just added - how would you add a "middle name" attribute without all the subsequent people having totally different names for a given random seed? – PhilHibbs Oct 20 '17 at 14:04
  • The way that I've handled this in the past is, every time I add some new attributes I use getstate/jumpahead/setstate on the Random object so as not to disturb the subsequent random values. – PhilHibbs Oct 20 '17 at 14:11
  • For anyone that comes across this, completely ignore this post, it's making recommendations with no consideration of context. Super high quality PRNGs are pretty terrible for procedural generation like this and trying to use them will introduce more problems than it solves. Look up how to use something like xxHash instead. – JamEngulfer Oct 09 '21 at 22:22
0

I have gone ahead and created a simple hash-based replacement for some of the functions in the random.Random class:

from __future__ import division
import xxhash
from numpy import sqrt, log, sin, cos, pi

def gaussian(u1, u2):
    z1 = sqrt(-2*log(u1))*cos(2*pi*u2)
    z2 = sqrt(-2*log(u1))*sin(2*pi*u2)
    return z1,z2

class pghash:
    def __init__(self, tuple, seed=0, sep=','):
        self.hex = xxhash.xxh64(sep.join(tuple), seed=seed).hexdigest()

    def pgvalue(self):
        return int(self.hex, 16)

    def pghalves(self):
        return self.hex[:8], self.hex[8:]

    def pgvalues(self):
        return int(self.hex[:8], 16), int(self.hex[8:], 16)

    def random(self):
        return self.value() / 2**64

    def randint(self, min, max):
        return int(self.random() * max + min)

    def gauss(self, mu, sigma):
        xx = self.pgvalues()
        uu = [xx[0]/2**32, xx[1]/2**32]
        return gaussian(uu[0],uu[1])[0]

Next step is to go through my code and replace all the calls to random.Random methods with pghash objects.

I have made this into a module, which I hope to upload to pypi at some point: https://github.com/UKHomeOffice/python-pghash

PhilHibbs
  • 859
  • 1
  • 13
  • 30