Build a large numpy array from itertools.product

Question

I want to build a numpy array from the result of itertools.product. My first approach was a simple:

from itertools import product
import numpy as np

max_init = 6
init_values = range(1, max_init + 1)
repetitions = 12

result = np.array(list(product(init_values, repeat=repetitions)))

This code works well for "small" repetitions (like <=4), but with "large" values (>= 12) it completely hogs the memory and crashes. I assumed that building the list was the thing eating all the RAM, so I searched how to make it directly with an array. I found Numpy equivalent of itertools.product and Using numpy to build an array of all combinations of two arrays.

So, I tested the following alternatives:

Alternative #1:

results = np.empty((max_init**repetitions, repetitions))
for i, row in enumerate(product(init_values, repeat=repetitions)):
    result[:][i] = row

Alternative #2:

init_values_args = [init_values] * repetitions
results = np.array(np.meshgrid(*init_values_args)).T.reshape(-1, repetitions)

Alternative #3:

results = np.indices([sides] * num_dice).reshape(num_dice, -1).T + 1

#1 is extremely slow. I didn't have enough patience to let it finish (after a few minutes of processing on a 2017 MacBook Pro). #2 and #3 eat all the memory until the python interpreter crashes, as with the initial approach.

After that, I thought that I could express the same information in a different way that was still useful for me: a dict where the keys would be all the possible (sorted) combinations, and the values would be the counting of these combinations. So I tried:

Alternative #4:

from collections import Counter

def sorted_product(iterable, repeat=1):
    for el in product(iterable, repeat=repeat):
        yield tuple(sorted(el))

def count_product(repetitions=1, max_init=6):
    init_values = range(1, max_init + 1)
    sp = sorted_product(init_values, repeat=repetitions)
    counted_sp = Counter(sp)
    return np.array(list(counted_sp.values())), \
        np.array(list(counted_sp.keys()))

cnt, values = count(repetitions=repetitions, max_init=max_init)

But the line counted_sp = Counter(sp), which triggers getting all the values of the generators, is also too slow for my needs (it also took several minutes before I canceled it).

Is there another way to generate the same data (or a different data structure containing the same information) that does not have the mentioned shortcomings of being too slow or using too much memory?

PS: I tested all the implementations above against my tests with a small repetitions, and all the tests passed, so they give consistent results.

I hope that editing the question is the best way to expand it. Otherwise, let me know, and I'll edit post where I should.

After reading the first two answers below and thinking about it, I agree that I am approaching the issue from the wrong angle. Instead of going with a "brute force" approach I should have used probabilities and work with that.

My intention is, later on, for each combination: - Count how many values are under a threshold X. - Count how many values are equal or over threshold X and below a threshold Y. - Count how many values are over threshold Y. And group the combinations that have the same counts.

As an illustrative example: If I roll 12 dice of 6 sides, what's the probability of having M dice with a value <3, N dice with a value >=3 and <4, and P dice with a value >5, for all possible combinations of M, N, and P?

So, I think that I'll close this question in a few days while I go with this new approach. Thank you for all the feedback and your time!

Might want to look at this: https://stackoverflow.com/a/11146645/4909087 — cs95, Oct 15 '17 at 22:45
Why do you want to do this? What will you do with the np.array assuming you could successfully create it? — Doug Coburn, Oct 15 '17 at 23:25

hpaulj · Accepted Answer · 2017-10-16T00:37:33.490

The number tuples that list(product(range(1,7), repeats=12)) makes is 6**12, 2,176,782,336. Whether a list or array that's probably too large for most computers.

In [119]: len(list(product(range(1,7),repeat=12)))
....
MemoryError:

Trying to make an array of that size directly:

In [129]: A = np.ones((6**12,12),int)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-129-e833a9e859e0> in <module>()
----> 1 A = np.ones((6**12,12),int)

/usr/local/lib/python3.5/dist-packages/numpy/core/numeric.py in ones(shape, dtype, order)
    190 
    191     """
--> 192     a = empty(shape, dtype, order)
    193     multiarray.copyto(a, 1, casting='unsafe')
    194     return a

ValueError: Maximum allowed dimension exceeded

Array memory size, 4 bytes per item

In [130]: 4*12*6**12
Out[130]: 104,485,552,128

100GB?

Why do you need to generate 2B combinations of 7 numbers?

So with your Counter you reduce the number of items

In [138]: sp = sorted_product(range(1,7), 2)
In [139]: counter=Counter(sp)
In [140]: counter
Out[140]: 
Counter({(1, 1): 1,
         (1, 2): 2,
         (1, 3): 2,
         (1, 4): 2,
         (1, 5): 2,
         (1, 6): 2,
         (2, 2): 1,
         (2, 3): 2,
         (2, 4): 2,
         (2, 5): 2,
         (2, 6): 2,
         (3, 3): 1,
         (3, 4): 2,
         (3, 5): 2,
         (3, 6): 2,
         (4, 4): 1,
         (4, 5): 2,
         (4, 6): 2,
         (5, 5): 1,
         (5, 6): 2,
         (6, 6): 1})

from 36 to 21 (for 2 repetitions). It shouldn't be hard to generalize this to more repetitions (combinations? permutations?) It still will push time and/or memory boundaries.

A variant on meshgrid using mgrid:

In [175]: n=7; A=np.mgrid[[slice(1,7)]*n].reshape(n,-1).T
In [176]: A.shape
Out[176]: (279936, 7)
In [177]: B=np.array(list(product(range(1,7),repeat=7)))
In [178]: B.shape
Out[178]: (279936, 7)
In [179]: A[:10]
Out[179]: 
array([[1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 2],
       [1, 1, 1, 1, 1, 1, 3],
       [1, 1, 1, 1, 1, 1, 4],
       [1, 1, 1, 1, 1, 1, 5],
       [1, 1, 1, 1, 1, 1, 6],
       [1, 1, 1, 1, 1, 2, 1],
       [1, 1, 1, 1, 1, 2, 2],
       [1, 1, 1, 1, 1, 2, 3],
       [1, 1, 1, 1, 1, 2, 4]])
In [180]: B[:10]
Out[180]: 
array([[1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 2],
       [1, 1, 1, 1, 1, 1, 3],
       [1, 1, 1, 1, 1, 1, 4],
       [1, 1, 1, 1, 1, 1, 5],
       [1, 1, 1, 1, 1, 1, 6],
       [1, 1, 1, 1, 1, 2, 1],
       [1, 1, 1, 1, 1, 2, 2],
       [1, 1, 1, 1, 1, 2, 3],
       [1, 1, 1, 1, 1, 2, 4]])
In [181]: np.allclose(A,B)

mgrid is quite a bit faster:

In [182]: timeit B=np.array(list(product(range(1,7),repeat=7)))
317 ms ± 3.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [183]: timeit A=np.mgrid[[slice(1,7)]*n].reshape(n,-1).T
13.9 ms ± 242 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

but, yes, it will have the same overall memory usage and limit.

With n=10,

ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.

score 0 · Answer 2 · answered Oct 15 '17 at 23:52

The right answer is: Don't. Whatever you want to do with all these combinations, adjust your approach so that you either generate them one at a time and use them immediately without storing them, or better yet, find a way to get the job done without inspecting every combination. Your current solution works for toy problems, but is not suitable for larger parameters. Explain what you are up to and maybe someone here can help.

Build a large numpy array from itertools.product

2 Answers2