18

I know about itertools.product for iterating on a list of several dimensions of keywords. For instance if I have this:

categories = [
    [ 'A', 'B', 'C', 'D'],
    [ 'E', 'F', 'G', 'H'],
    [ 'I', 'J', 'K', 'L']
]

and I use itertools.product() over it, I have something like:

>>> [ x for x in itertools.product(*categories) ]
('A', 'E', 'I'),
('A', 'E', 'J'),
('A', 'E', 'K'),
('A', 'E', 'L'),
('A', 'F', 'I'),
('A', 'F', 'J'),
# and so on...

Is there an equivalent, straightforward way of doing the same thing with numpy's arrays?

Jivan
  • 21,522
  • 15
  • 80
  • 131

1 Answers1

21

This question has been asked a couple of times already:

Using numpy to build an array of all combinations of two arrays

itertools product speed up

The first link has a working numpy solution, that is claimed to be several times faster than itertools, though no benchmarks are provided. This code was written by a user named pv. Please, follow the link and support his answer if you find it useful:

import numpy as np

def cartesian(arrays, out=None):
    """
    Generate a cartesian product of input arrays.

    Parameters
    ----------
    arrays : list of array-like
        1-D arrays to form the cartesian product of.
    out : ndarray
        Array to place the cartesian product in.

    Returns
    -------
    out : ndarray
        2-D array of shape (M, len(arrays)) containing cartesian products
        formed of input arrays.

    Examples
    --------
    >>> cartesian(([1, 2, 3], [4, 5], [6, 7]))
    array([[1, 4, 6],
           [1, 4, 7],
           [1, 5, 6],
           [1, 5, 7],
           [2, 4, 6],
           [2, 4, 7],
           [2, 5, 6],
           [2, 5, 7],
           [3, 4, 6],
           [3, 4, 7],
           [3, 5, 6],
           [3, 5, 7]])

    """

    arrays = [np.asarray(x) for x in arrays]
    dtype = arrays[0].dtype

    n = np.prod([x.size for x in arrays])
    if out is None:
        out = np.zeros([n, len(arrays)], dtype=dtype)

    m = n / arrays[0].size
    out[:,0] = np.repeat(arrays[0], m)
    if arrays[1:]:
        cartesian(arrays[1:], out=out[0:m,1:])
        for j in xrange(1, arrays[0].size):
            out[j*m:(j+1)*m,1:] = out[0:m,1:]
    return out

Nevertheless, in the same post Alex Martelli - he is a great Python guru at SO - wrote, that itertools was the fastest way to get this task done. So here is a quick benchmark, that proves Alex's words.

import numpy as np
import time
import itertools


def cartesian(arrays, out=None):
    ...


def test_numpy(arrays):
    for res in cartesian(arrays):
        pass


def test_itertools(arrays):
    for res in itertools.product(*arrays):
        pass


def main():
    arrays = [np.fromiter(range(100), dtype=int), np.fromiter(range(100, 200), dtype=int)]
    start = time.clock()
    for _ in range(100):
        test_numpy(arrays)
    print(time.clock() - start)
    start = time.clock()
    for _ in range(100):
        test_itertools(arrays)
    print(time.clock() - start)

if __name__ == '__main__':
    main()

Output:

0.421036
0.06742

So, you should definitely use itertools.

Community
  • 1
  • 1
Eli Korvigo
  • 10,265
  • 6
  • 47
  • 73
  • 1
    Thanks for the extended answer and the advice coming with it – Jivan Feb 23 '15 at 22:43
  • 11
    The speed difference is because you are iterating over the cartesian() result, and iteration over numpy arrays is slower than over Python iterators. If you only want to construct the array, you need to compare `cartesian(...)` to `np.array(list(itertools.product(...)))`. For iteration, itertools is the correct answer however, but the question here is asking about construction. – pv. Feb 24 '15 at 09:42
  • 1
    @Jivan As pv. has pointed, his numpy function will be faster to build a numpy array due to significant overhead of converting Python iterator (produced by `itertools.product`) into a numpy array, because numpy arrays of objects (tuples in this case) can't be created directly from iterators. In my tests it's ~5 times faster, but you should remember that iterating over numpy arrays is significantly slower (over 5 times slower according to my test posted above), therefore you should use iterators if speed is your main concern. – Eli Korvigo Feb 24 '15 at 10:48
  • 8
    The answer is wrong and likely misleading. It is much faster to use numpy.meshgrid *if you are starting from and ending up with numpy arrays*. It becomes aboud RAM io really at that point. But it's like orders of magnitude for me to use meshgrid and ravel concat without much optimization. np.array(list(itertoolsproduct(x, y))) is not fast. – mathtick Jul 09 '21 at 18:18