40

I am trying to create a huge boolean matrix which is randomly filled with True and False with a given probability p. At first I used this code:

N = 30000
p = 0.1
np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])  

But sadly it does not seem to terminate for this big N. So I tried to split it up into the generation of the single rows by doing this:

N = 30000
p = 0.1
mask = np.empty((N, N))
for i in range (N):
     mask[i] = np.random.choice(a=[False, True], size=N, p=[p, 1-p])            
     if (i % 100 == 0):
          print(i)

Now, there happens something strange (at least on my device): The first ~1100 rows are very fastly generated - but after it, the code becomes horribly slow. Why is this happening? What do I miss here? Are there better ways to create a big matrix which has True entries with probability p and False entries with probability 1-p?

Edit: As many of you assumed that the RAM will be a problem: As the device which will run the code has almost 500GB RAM, this won't be a problem.

zimmerrol
  • 4,872
  • 3
  • 22
  • 41
  • Failed to understand that part - `does not seem to terminate for this big N`. Clarifications on it? – Divakar Apr 20 '17 at 19:53
  • Why a boolean-array but not setting the dtype? And check iif your memory is enough. Otherwise trashing will slow down every approach. – sascha Apr 20 '17 at 19:53
  • once memory exhausted, it slows down machine? – Serge Apr 20 '17 at 19:53
  • @Serge But why? I mean: I am creating the NxN array in line 3 - so there is no reason why the memory should get exhausted. Furthermore, the memory is really no problem, as there is plenty in the machine - roughly 0.5 TB. – zimmerrol Apr 20 '17 at 19:54
  • check http://stackoverflow.com/questions/1053928/very-large-matrices-using-python-and-numpy – Serge Apr 20 '17 at 19:55
  • @Divakar "does not seem to terminate for this big N" means that I was not able to get a result after waiting a really long time. – zimmerrol Apr 20 '17 at 19:55
  • How much RAM do you have? 900 million elements is a lot, and NumPy creates an intermediate array of 900 million float64s to decide what elements to pick. Peak RAM usage is going to be over 8 GB. – user2357112 Apr 20 '17 at 19:56
  • I thought you slow down on 1100rd row, not 3rd – Serge Apr 20 '17 at 19:56
  • @Serge Yes, roughly after the 1100th rough has been generated. I meant, that I am reserving the memory for the array in line (I wrote 'row' before by accident) 3. – zimmerrol Apr 20 '17 at 19:58
  • 1
    @FlashTek: Your OS won't actually commit RAM to that allocation until you write to it. – user2357112 Apr 20 '17 at 19:59
  • @user2357112 Okay, my bad. But when i change `np.empty` to `np.zeros` the RAM will be allocated, won't it? After doing this change the slowdown is happening at row 23000. – zimmerrol Apr 20 '17 at 20:02
  • Not sure, not expert. I guess creating empty table is faster than one filled values. Anyway suggest read my link. – Serge Apr 20 '17 at 20:02
  • if you have 500GB of RAM how is your first method not being finished? I am confused... – gold_cy Apr 20 '17 at 20:21
  • This is slightly faster though it won't solve if there is any memory problems: `np.random.binomial(1, 1-p, size=(N, N)).astype('bool')` – ayhan Apr 20 '17 at 20:28
  • @ayhan that's actually the slowest method of the 3 --> `CPU times: user 21 s, sys: 1.78 s, total: 22.8 s Wall time: 22.8 s` – gold_cy Apr 20 '17 at 20:32
  • @DmitryPolonskiy I don't have enough RAM to test it with N=30000 but for N=15000 it is faster http://imgur.com/fQUbYCa – ayhan Apr 20 '17 at 20:38
  • just to be sure, OS/numpy is 64 bit? – Serge Apr 20 '17 at 20:39
  • another trick is flash the cache (collect the garbage in the middle) http://stackoverflow.com/questions/14351255/techniques-for-working-with-large-numpy-arrays – Serge Apr 20 '17 at 20:53
  • Can you verify if the memory usage before and during execution, just to verify that the system has enough free memory. Can you share the result of `mask.dtype`, `mask.nbytes` and `mask[0][[0]].nbytes` – shanmuga Apr 20 '17 at 21:05
  • @DmitryPolonskiy Yeah... this is confusing me just as well. The code is slowing dramatically down, so that it does not finish in time as already described. – zimmerrol Apr 21 '17 at 05:44
  • @Serge Yes, its a 64bit setup. – zimmerrol Apr 21 '17 at 05:44

5 Answers5

30

The problem is your RAM, the values are being stored in memory as it's being created. I just created this matrix using this command:

np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])

I used an AWS i3 instance with 64GB of RAM and 8 cores. To create this matrix, htop shows that it takes up ~20GB of RAM. Here is a benchmark in case you care:

time np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])

CPU times: user 18.3 s, sys: 3.4 s, total: 21.7 s
Wall time: 21.7 s


 def mask_method(N, p):
    for i in range(N):
        mask[i] = np.random.choice(a=[False, True], size=N, p=[p, 1-p])
        if (i % 100 == 0):
            print(i)

time mask_method(N,p)

CPU times: user 20.9 s, sys: 1.55 s, total: 22.5 s
Wall time: 22.5 s

Note that the mask method only takes up ~9GB of RAM at it's peak.

Edit: The first method flushes the RAM after the process is done where as the function method retains all of it.

gold_cy
  • 13,648
  • 3
  • 23
  • 45
24

So I tried to split it up into the generation of the single rows by doing this:

The way that np.random.choice works is by first generating a float64 in [0, 1) for every cell of your data, and then converting that into an index in your array using np.search_sorted. This intermediate representation is 8 times larger than the boolean array!

Since your data is boolean, you can get a factor of two speedup with

np.random.rand(N, N) > p

Which naturally, you could use inside your looping solution

It seems like np.random.choice could do with some buffering here - you might want to file an issue against numpy.

Another option would be to try and generate float32s instead of float64s. I'm not sure if numpy can do that right now, but you could request the feature.

Eric
  • 95,302
  • 53
  • 242
  • 374
  • Okay, interesting - the solution with`np.random.rand(N, N) > p` was my first idea which I discarded as I thought that the direct numpy call would be faster. – zimmerrol Apr 21 '17 at 05:42
  • @FlashTek: Problem is that `np.random.choice` has to do more work, since it's has to handle cases with more than two outcomes. Definitely scope for a special case when the number of choices is two – Eric Apr 21 '17 at 08:57
  • Alright. But do you know why this kind of slowdown happens in my first post? – zimmerrol Apr 22 '17 at 14:53
  • Which kind of slowdown? The reason your second attempt is faster is because you're not allocating all the floats at once, and they're much bigger than your final result. – Eric Apr 22 '17 at 16:56
  • No, I mean the slowdown in the second attempt which happens ~ after row 1100 has been generated - as described above. – zimmerrol Jun 18 '17 at 08:55
9

Really surprised no one has mentioned this solution yet..

This line

np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])

runs NXN Bernoulli Trials. (In your case, 900M of them!) A Bernoulli trial is just a random experiment with two possible outcomes, with probabilities p and 1-p.

The sum of N Bernoulli trials, each with probability p, can be modeled by the Binomial distribution.

We can leverage this fact to randomly simulate the total count of True elements. With NumPy,

import numpy as np

N = 30000
p = 0.1

# Build a random number generator
rng = np.random.default_rng(123)

# Randomly determine the total number of True values
Ntrue = rng.binomial(n=N*N, p=p, size=1)[0]  # 90016776

Now we can randomly determine the position of each True element by randomly choosing row and col indices without replacement.

# Randomly determine true position
position_ids = rng.choice(a=N*N, size=Ntrue, replace=False)
positions = np.unravel_index(position_ids, shape=(N,N))

And now we can populate a compressed sparse row (CSR) matrix.

from scipy import sparse

# Build a compressed sparse row matrix with the constructor:
# csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)])
result = sparse.csr_matrix((np.ones(shape=Ntrue), positions), shape=(N,N))

Notice this solution avoids storing and computing 900M boolean values.

Funny enough, I wrote about a nearly identical problem before stumbling upon this question.

Ben
  • 20,038
  • 30
  • 112
  • 189
0

Another possibility could be to generate it in a batch (i.e. compute many sub-arrays and stack them together at the very end). But, consider not to update one array (mask) in a for loop as OP is doing. This would force the whole array to load in main memory during every indexing update.

Instead for example: to get 30000x30000, have 9000 100x100 separate arrays, update each of this 100x100 array accordingly in a for loop and finally stack these 9000 arrays together in a giant array. This would definitely need not more than 4GB of RAM and would be very fast as well.

Minimal Example:

In [9]: a
Out[9]: 
array([[0, 1],
       [2, 3]])

In [10]: np.hstack([np.vstack([a]*5)]*5)
Out[10]: 
array([[0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
       [2, 3, 2, 3, 2, 3, 2, 3, 2, 3],
       [0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
       [2, 3, 2, 3, 2, 3, 2, 3, 2, 3],
       [0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
       [2, 3, 2, 3, 2, 3, 2, 3, 2, 3],
       [0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
       [2, 3, 2, 3, 2, 3, 2, 3, 2, 3],
       [0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
       [2, 3, 2, 3, 2, 3, 2, 3, 2, 3]])

In [11]: np.hstack([np.vstack([a]*5)]*5).shape
Out[11]: (10, 10)
kmario23
  • 57,311
  • 13
  • 161
  • 150
-2

You can use random number generator for this like:

rng= np.random.default_rng()

random_bools= rng.integers(0,1,(4,3),endpoint= True).astype('bool')

It will give you random boolean array of size (4,3) which you can choose according to your need.

Kasif
  • 1
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jul 13 '23 at 20:56