3

I am using a list of list with varying sizes. For example alternativesList can include 4 lists in one iteration and 7 lists in the other.

What i am trying to do is capture every combination of words in different lists.

Lets say that

a= [1,2,3]
alternativesList.append(a)
b = ["a","b","c"]
alternativesList.append(b)

productList = itertools.product(*alternativesList)

will create

[(1, 'a'), (1, 'b'), (1, 'c'), (2, 'a'), (2, 'b'), (2, 'c'), (3, 'a'), (3, 'b'), (3, 'c')]

One problem here is that my productList can be so large it can cause memory problems. So i am using productList as object and iterate over it later.

What i want to know is that is there a way to create same object with numpy which works faster than itertools?

rkd
  • 59
  • 1
  • 5
  • 4
    `itertools.product` returns an iterator that generates only one tuple at a time. You can't possibly have memory problems with that unless you do something silly like calling `list` on it. You won't find a more memory efficient solution. – Aran-Fey Mar 25 '18 at 12:03
  • what i am trying to say that when i call it as list, it causes memory problems. Therefore i am not calling as list. The real question is that is there way to do to it faster with numpy without having memory problems as well. – rkd Mar 25 '18 at 12:10
  • Possible duplicate of: https://stackoverflow.com/questions/4709510/itertools-product-speed-up – game0ver Mar 25 '18 at 12:12
  • This might help you: https://stackoverflow.com/questions/11144513/numpy-cartesian-product-of-x-and-y-array-points-into-single-array-of-2d-points – readyready15728 Mar 25 '18 at 12:39
  • You'll want to provide the full code which triggers memory problems. As Aran-Fey said, with the current code sample you shouldn't have any issues as long as you don't do something silly. – Brad Koch Mar 25 '18 at 12:52
  • An array is somewhat more compact than the equivalent nested list, but not drastically so - 2 or 3x, not 10. And with an array you have the problem of mixing integers and strings - that's possible, not simple. – hpaulj Mar 25 '18 at 16:58
  • 1
    Possible duplicate of [itertools product speed up](https://stackoverflow.com/questions/4709510/itertools-product-speed-up) – klutt Mar 25 '18 at 18:31

2 Answers2

1

You can avoid some problems arising from numpy trying to find catchall dtype by explicitly specifying a compound dtype:

Code + some timings:

import numpy as np
import itertools

def cartesian_product_mixed_type(*arrays):
    arrays = *map(np.asanyarray, arrays),
    dtype = np.dtype([(f'f{i}', a.dtype) for i, a in enumerate(arrays)])
    out = np.empty((*map(len, arrays),), dtype)
    idx = slice(None), *itertools.repeat(None, len(arrays) - 1)
    for i, a in enumerate(arrays):
        out[f'f{i}'] = a[idx[:len(arrays) - i]]
    return out.ravel()

a = np.arange(4)
b = np.arange(*map(ord, ('A', 'D')), dtype=np.int32).view('U1')
c = np.arange(2.)

np.set_printoptions(threshold=10)

print(f'a={a}')
print(f'b={b}')
print(f'c={c}')

print('itertools')
print(list(itertools.product(a,b,c)))
print('numpy')
print(cartesian_product_mixed_type(a,b,c))

a = np.arange(100)
b = np.arange(*map(ord, ('A', 'z')), dtype=np.int32).view('U1')
c = np.arange(20.)

import timeit
kwds = dict(globals=globals(), number=1000)

print()
print(f'a={a}')
print(f'b={b}')
print(f'c={c}')

print(f"itertools: {timeit.timeit('list(itertools.product(a,b,c))', **kwds):7.4f} ms")
print(f"numpy:     {timeit.timeit('cartesian_product_mixed_type(a,b,c)', **kwds):7.4f} ms")

a = np.arange(1000)
b = np.arange(1000, dtype=np.int32).view('U1')

print()
print(f'a={a}')
print(f'b={b}')

print(f"itertools: {timeit.timeit('list(itertools.product(a,b))', **kwds):7.4f} ms")
print(f"numpy:     {timeit.timeit('cartesian_product_mixed_type(a,b)', **kwds):7.4f} ms")

Sample output:

a=[0 1 2 3]
b=['A' 'B' 'C']
c=[0. 1.]
itertools
[(0, 'A', 0.0), (0, 'A', 1.0), (0, 'B', 0.0), (0, 'B', 1.0), (0, 'C', 0.0), (0, 'C', 1.0), (1, 'A', 0.0), (1, 'A', 1.0), (1, 'B', 0.0), (1, 'B', 1.0), (1, 'C', 0.0), (1, 'C', 1.0), (2, 'A', 0.0), (2, 'A', 1.0), (2, 'B', 0.0), (2, 'B', 1.0), (2, 'C', 0.0), (2, 'C', 1.0), (3, 'A', 0.0), (3, 'A', 1.0), (3, 'B', 0.0), (3, 'B', 1.0), (3, 'C', 0.0), (3, 'C', 1.0)]
numpy
[(0, 'A', 0.) (0, 'A', 1.) (0, 'B', 0.) ... (3, 'B', 1.) (3, 'C', 0.)
 (3, 'C', 1.)]

a=[ 0  1  2 ... 97 98 99]
b=['A' 'B' 'C' ... 'w' 'x' 'y']
c=[ 0.  1.  2. ... 17. 18. 19.]
itertools:  7.4339 ms
numpy:      1.5701 ms

a=[  0   1   2 ... 997 998 999]
b=['' '\x01' '\x02' ... 'ϥ' 'Ϧ' 'ϧ']
itertools: 62.6357 ms
numpy:      8.0249 ms
Paul Panzer
  • 51,835
  • 3
  • 54
  • 99
  • Fantastic. It does what i want to do exactly. However, I believe it first put every possible combination to a list than returns it. This will probably cause memory problems for me. – rkd Mar 25 '18 at 15:49
  • @rkd It stores everything in one large array yes. What you gain by that is a factor of ~6 in the last example: A list of pairs takes 72 bytes per pair - 8 for the pointer in the list to the tuple, 48 for the tuple and 16 for the two pointers to the tuple elements. The array stores everything directly, so takes only 12 bytes per pair - 8 for the int (on linux, on windows it might even be only 4, not sure, though) and 4 for the unicode character. If that's still too large, you could try and chunk it - failing that you are as far as I can tell stuck with itertools / generator approach. – Paul Panzer Mar 25 '18 at 18:32
  • @PaulPanzer Wow! Your answer taught me few things about python programming and numpy. – MSS Jul 28 '23 at 14:25
0

Generally speaking, if we consider the optimization as a balance scale memory and runtime would be its two Weighing dishes. This is to say that memory optimization and runtime optimization have an indirect relation together (not always but most of the times). Now, regarding your question:

Is there a way to create same object with numpy which works faster than itertools?

Definitely there are, but another point that you need to notice is that abstraction will give you a much more flexibility and that's what itertools.product gives you and Numpy don't. If the scalability is not an important facto in this case you can do this with Numpy and don't give up any benefits. Here is one way using column_stack, repeat and tile functions:

In [5]: np.column_stack((np.repeat(a, b.size),np.tile(b, a.size)))
Out[5]: 
array([['1', 'a'],
       ['1', 'b'],
       ['1', 'c'],
       ['2', 'a'],
       ['2', 'b'],
       ['2', 'c'],
       ['3', 'a'],
       ['3', 'b'],
       ['3', 'c']], dtype='<U21')

Now, still there are some ways to make this array to occupies less memory by using lighter types like U2, U1, etc.

In [10]: np.column_stack((np.repeat(a, b.size),np.tile(b, a.size))).astype('U1')
Out[10]: 
array([['1', 'a'],
       ['1', 'b'],
       ['1', 'c'],
       ['2', 'a'],
       ['2', 'b'],
       ['2', 'c'],
       ['3', 'a'],
       ['3', 'b'],
       ['3', 'c']], dtype='<U1') 
Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • unfortunately scalability is a huge factor for me. after your answer i did some researched and found out that this would work me: numpy.array(numpy.meshgrid(*alternativesList)).T.reshape(-1,len(alternativesList)). but for some reason it works slower. and also looks like it is gonna create a memory problem – rkd Mar 25 '18 at 13:23