6

I'm looking to generate the cartesian product of a relatively large number of arrays to span a high-dimensional grid. Because of the high dimensionality, it won't be possible to store the result of the cartesian product computation in memory; rather it will be written to hard disk. Because of this constraint, I need access to the intermediate results as they are generated. What I've been doing so far is this:

for x in xrange(0, 10):
    for y in xrange(0, 10):
        for z in xrange(0, 10):
            writeToHdd(x,y,z)

which, apart from being very nasty, is not scalable (i.e. it would require me writing as many loops as dimensions). I have tried to use the solution proposed here, but that is a recursive solution, which therefore makes it quite hard to obtain the results on the fly as they are being generated. Is there any 'neat' way to do this other than having a hardcoded loop per dimension?

Community
  • 1
  • 1
danielvdende
  • 690
  • 9
  • 23

3 Answers3

4

In plain Python, you can generate the Cartesian product of a collection of iterables using itertools.product.

>>> arrays = range(0, 2), range(4, 6), range(8, 10)
>>> list(itertools.product(*arrays))
[(0, 4, 8), (0, 4, 9), (0, 5, 8), (0, 5, 9), (1, 4, 8), (1, 4, 9), (1, 5, 8), (1, 5, 9)]

In Numpy, you can combine numpy.meshgrid (passing sparse=True to avoid expanding the product in memory) with numpy.ndindex:

>>> arrays = np.arange(0, 2), np.arange(4, 6), np.arange(8, 10)
>>> grid = np.meshgrid(*arrays, sparse=True)
>>> [tuple(g[i] for g in grid) for i in np.ndindex(grid[0].shape)]
[(0, 4, 8), (0, 4, 9), (1, 4, 8), (1, 4, 9), (0, 5, 8), (0, 5, 9), (1, 5, 8), (1, 5, 9)]
Gareth Rees
  • 64,967
  • 9
  • 133
  • 163
1

I think I figured out a nice way using a memory mapped file:

def carthesian_product_mmap(vectors, filename, mode='w+'):
    '''
    Vectors should be a tuple of `numpy.ndarray` vectors. You could
    also make it more flexible, and include some error checking
    '''        
    # Make a meshgrid with `copy=False` to create views
    grids = np.meshgrid(*vectors, copy=False, indexing='ij')

    # The shape for concatenating the grids from meshgrid
    shape = grid[0].shape + (len(vectors),)

    # Find the "highest" dtype neccesary
    dtype = np.result_type(*vectors)

    # Instantiate the memory mapped file
    M = np.memmap(filename, dtype, mode, shape=shape)

    # Fill the memmap with the grids
    for i, grid in enumerate(grids):
        M[...,i] = grid

    # Make sure the data is written to disk (optional?)
    M.flush()

    # Reshape to put it in the right format for Carthesian product
    return M.reshape((-1, len(vectors)))

But I wonder if you really need to store the whole Carthesian product (there's a lot of data duplication). Is it not an option to generate the rows in the product at the moment they're needed?

  • You could also do assignment by broadcasting, see http://stackoverflow.com/a/11146645/2379410 –  Jul 09 '14 at 13:41
  • Just out of curiosity; would Numpy memmap outperform a database-based solution? I guess a database would have some overhead to get the connection set up etc, but I would think that databases provide 'smart' indexing/compression systems etc? – danielvdende Jul 14 '14 at 07:50
  • @danielvdende, I don't have experience with databases.. Maybe when the disk IO is the bottleneck it could be equally fast as Numpy –  Jul 15 '14 at 12:49
0

It seems you just want to loop over an arbitrary number of dimensions. My generic solution for this is using an index field and increment indices plus handling overflows.

Example:

n = 3 # number of dimensions
N = 1 # highest index value per dimension

idx = [0]*n
while True:
    print(idx)
    # increase first dimension
    idx[0] += 1
    # handle overflows
    for i in range(0, n-1):
        if idx[i] > N:
            # reset this dimension and increase next higher dimension
            idx[i] = 0
            idx[i+1] += 1
    if idx[-1] > N:
        # overflow in the last dimension, we are finished
        break

Gives:

[0, 0, 0]
[1, 0, 0]
[0, 1, 0]
[1, 1, 0]
[0, 0, 1]
[1, 0, 1]
[0, 1, 1]
[1, 1, 1]

Numpy has something similar inbuilt: ndenumerate.

NoDataDumpNoContribution
  • 10,591
  • 9
  • 64
  • 104