2

I am in a situation where I need to generate a very large (~10^16 elements) random matrix with a certain random sparsity pattern. Obviously storing all of these elements is completely infeasible. Only a few of the elements are needed at any given time, so they can be drawn on-demand - however, once an element is stored, it may be needed later, and it's important that the same value is used. i.e, elements cannot be thrown out and randomly redrawn - once a random element is drawn, it needs to be saved.

Based on the problem itself, there are potentially some clever ways to handle this that I won't get into. However, a colleague says that it should be possible to deterministically generate these random numbers as needed by using a pseudorandom number generator with seed given by the index in the matrix, i.e. use i + N*j for element (i, j) of the matrix where the matrix is size N*N. This would not be calling the rand() function, but using the underlying pseudorandom function with specific arguments to deterministically generate previously drawn random values. Then no numbers would need to be saved, and they could be deterministically redrawn on demand.

My understanding of PRG's was that for a sequence of numbers appear random, you must fix the seed. Does the above approach make sense? It seems to me to be like repeatedly re-seeding the PRG and just taking the first element.

The Wind-Up Bird
  • 661
  • 2
  • 8
  • 17
  • Comments are not for extended discussion; this conversation has been [moved to chat](http://chat.stackoverflow.com/rooms/160639/discussion-on-question-by-the-wind-up-bird-deterministic-pseudorandom-number-g). – Andy Dec 06 '17 at 17:47
  • 1
    Why don't you just try? It looks simple enough. – Stop harming Monica Dec 06 '17 at 17:49
  • This sounds very broad and unclear. The most important question for me is: what do you want to do with this matrix? What operations are needed (indexing/lookup only; arithmetic)? Think in terms of the usual sparse-matrix formats like lil, dok, coo, csr and co. which are different in terms of allowed (or let's call it: efficient) operations and also in terms of handling duplicate values. – sascha Dec 06 '17 at 18:13
  • Your use case is so specialized that, if it were me, I'd design my own element generator. Your last paragraph sums up the problem: designers expect their RNGs to be used by taking a long stream of numbers after seeding, not many seeds and one number for each seed. – President James K. Polk Dec 06 '17 at 18:59
  • Use a hash for this. Seeding a random generator based on the matrix index, then getting a random number is a bad idea (even, if it is not the first random number, but the1000th). Usually, a random generator produces numbers which has strong correlation to the initial seed. – geza Dec 06 '17 at 19:34

2 Answers2

1

Not a precise answer, but some tryes.

A hash function seems to be a simple and efficient mean to achieve your goal.

Here are some good ideas about integer to integer hash-functions.

From this article I try that :

from numba import uint64, njit 
import pylab as pl

@njit(uint64(uint64,uint64))    
def hash64(i,j) :
    x= i + (j << 32)
    x = (x ^ (x >> 30)) * (0xbf58476d1ce4e5b9);
    x = (x ^ (x >> 27)) * (0x94d049bb133111eb);
    x = x ^ (x >> 31);
    return x;  

n=1000    
im=[[hash64(i,j) for i in range(n)] for j in range(n)]
pl.subplot(121)
pl.imshow(im)
pl.colorbar()
pl.subplot(122)
pl.hist(np.array(im).ravel(),bins=100)
pl.show()  

This numba hash64 function compute a hash code in ~ 200 ns.

And this plot (even it demonstrates nothing) shows that this function could be a good candidate.

enter image description here

By contrast the python hash function (hash((i,j)) on tuple) do not pass the test : enter image description here

Here for A Knuth Generator :

enter image description here

And some benchmarks :

In [61]: %timeit hash64(0,1)
215 ns ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [62]: %timeit np.random.seed(0+1<<30);a=np.random.randint(2**30)
4.18 µs ± 126 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [63]:%timeit  hash((0,1))
102 ns ± 19.5 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each) 
B. M.
  • 18,243
  • 2
  • 35
  • 54
  • There is no link to sparse-matrices. Let's assume we wan't to do one of the most easy tasks involving sparse-matrices: get all nonzeros (in time linear in #nnz): how would you do that? (an outer random binary-decision index-wise might generate a sparse-matrix, but finding nnz's is then as costly as trying all n*m indices). – sascha Dec 06 '17 at 19:30
  • @sascha perhaps it was a bit misleading when I called it a "sparse matrix", because it implies I want to use this matrix to do typical sparse matrix operations. It is a matrix, formally, but the only time that it ever appears in my problem is as a quadratic form with a vector, and only parts of that quadratic form ever need to be computed at a time. So, while it is a sparse matrix, any typical matrix operation is unnecessary (no multiplication). I just need to be able to look up the elements, and element (i, j) has a meaning in the problem. – The Wind-Up Bird Dec 07 '17 at 14:45
  • Is this hash pseudorandom, though? In the sense that I could dictate the distribution from which the elements appear to be drawn? – The Wind-Up Bird Dec 07 '17 at 14:49
  • @TheWind-UpBird Sure. But that also implies the need for some operations. Imagine a coo-format. Probably the easiest for your task of non-explicit PRNG-like storage. In this quad-form, you will need to query the matrix what entries are within this row/column (in linear time of nnzs in this row/column only). Which is quite a bad thing for coo for example. It does not allow efficient indexing (will need going through all nnzs of the whole matrix; even when looking for index i,v). (calculating some distribution-based positions/values of elements seems like the least of all problems to me) – sascha Dec 07 '17 at 14:57
  • @TheWind-UpBird And in regards to your pseudorandom-question: it's unclear what you want. I never saw some distribution-model. The basic concept is: with access to uniform numbers in [0,1) (which a good hash provides; in best-case a crypto-hash which is slow), you can always generate any distribution, at the cost of using more than one uniform number (e.g. wiki's article of sampling gaussians for a specialized variant; inverse-transform sampling for a general slower approach given a cdf). – sascha Dec 07 '17 at 15:03
0

Basically, you want 1016 ~ 253 random numbers which uniquely determined by matrix element linear index.

Simplest random number generator would be 64bit Linear Congruential one. Basically any reasonable one with 63-64bits length and full period will work. Code below implements Knuth LCG with seed equal to matrix element linear index. Another good property of the LCG is jump-ahead with logarithmic complexity - jump ~ log2(N)

Code

import numpy as np

A = np.uint64(6364136223846793005) # multiplier
C = np.uint64(1442695040888963407) # increment
K = np.float64(5.421010862427522170037E-20) # 1/2^64

def K64(seed):
    """
    Knuth 64bit MMIX LCG, return uint64 in the range [0...2^64)
    """
    return A*np.uint64(seed) + C

def KLCG64(seed):
    """
    Knuth 64bit MMIX LCG, return float64 in the range [0.0...1.0)
    """
    #return np.float64(K64(seed))*K # here is proper implementation, but slower than inlined code below
    return np.float64(A*np.uint64(seed) + C)*K

def randomMat(i, j, N):
    """
    return random element of the matrix
    """
    return KLCG64(np.uint64(i) + np.uint64(j)*np.uint64(N))

np.seterr(over='ignore') # important! LCG relies on unsigned overflow for mod operation

print(randomMat(123738, 9276, 735111))
Severin Pappadeux
  • 18,636
  • 3
  • 38
  • 64
  • Two questions. 1) Is there no issue about seeding with the matrix element index, in terms of correlations between the generated random numbers? 2) Jump-ahead complexity means the ability to produce the Nth number in logarithmic time? Would it be better to just use that property and consider the whole thing one sequence? – The Wind-Up Bird Dec 07 '17 at 14:52
  • @TheWind-UpBird 2. Jump-ahead complexity means given current seed (say, position #0),you could skip generation of N numbers instead of O(N) time (basically, calling RNG N times) using O(log2(N)) algorithm and arriving at position #N.It is seed independent algorithm.PCG random has better random property but still retains such inherited from LCG feature. `Would it be better to just use that property and consider the whole thing one sequence?` sure, up to you, but basically you're slowing yourself down a bit - instead of single RNG call (O(1)) now you have O(log2N)) call.But should work,I believe – Severin Pappadeux Dec 07 '17 at 15:53
  • @TheWind-UpBird 1. There is a correlation, no questions about it. But there is a correlation for each and any algorithm in existence - after all, for each index there are some predetermined and fixed set of instructions to be executed, that's it. So the only question is how much correlation you would be able to tolerate. There is further development of LCG - PCG random (pcg-random.org) which is hardened LCG but retaining log2(N) jump-ahead feature. Python code: https://github.com/rkern/pcg-python – Severin Pappadeux Dec 07 '17 at 15:55