A Cartesian product function that can yield chunks of result for large arrays

Question

@Paul Panzer shared an excellent answer on how to perform the cartesian product of a list of NumPy arrays efficiently. I have modified his cartesian_product_transpose_pp(arrays) function to show the iteration process occurs from the left to right column of the returned array.

import numpy
import itertools
import time

def cartesian_product_transpose_pp(arrays):
    la = len(arrays)
    dtype = numpy.result_type(*arrays)
    arr = numpy.empty((la, *map(len, arrays)), dtype=dtype)
    idx = slice(None), *itertools.repeat(None, la)
    for i, a in enumerate(arrays):
        arr[i, ...] = a[idx[:i]] #my modification   
    return arr.reshape(la, -1).T

mumax = 18
mumin = 1
nsample = 8
mu_list = [ i for i in range(mumin, mumax+1, 1) ]
mu_array = np.array( mu_list, dtype=np.uint8 )
mu_alist = [ mu_array ] * nsample 

start = time.time()
cartesian_product_transpose_pp( mu_alist  )
end = time.time()
print( f'\ncartesian_product_transpose_pp Time: {(end - start)}sec' )

However, when this function's argument( i.e. arrays) exceeds a certain size, it will require a very large arr and fail due to MemoryError. Example:

arr = np.empty( ( la, *map(len, arrays) ), dtype=dtype )
MemoryError: Unable to allocate 82.1 GiB for an array with shape (8, 18, 18, 18, 18, 18, 18, 18, 18) and data type uint8

To address this memory error, I would like to break arr into smaller chunks so as to be able to yield smaller chunks of arr.reshape(la, -1).T How do I do this when the value of nsample increases?

Updated Test code that I am now using:

import numpy as np
import itertools
import time
import sys

def cartesian_product_transpose_pp( arrays):
    la = len(arrays)
    dtype = np.result_type(*arrays)
    arr = np.empty((la, *map(len, arrays)), dtype=dtype)
    idx = slice(None), *itertools.repeat(None, la)
    for i, a in enumerate(arrays):
        arr[i, ...] = a[idx[:i]] 
    return arr.reshape(la, -1).T

mumax = 18
mumin = 1
nsample = 9 
mu_list = [ i for i in range(mumin, mumax+1, 1) ]
mu_array = np.array( mu_list, dtype=np.uint8 )
mu_alist = [ mu_array ] * nsample

a = mu_alist
start = time.time()
c = 1
result = (
    cartesian_product_transpose_pp( [ *x[:,None], *a[c:] ] )
    for x in cartesian_product_transpose_pp( a[:c] )
    )
with np.printoptions(threshold=sys.maxsize):
    for n, i in enumerate( result ):
        #print( n, i ) #for debugging
        a = i
end = time.time()
print( f'\ncartesian_product_transpose_pp Time: {(end - start)}' )

Error Msg:

    arr = np.empty((la, *map(len, arrays)), dtype=dtype)
MemoryError: Unable to allocate 92.4 GiB for an array with shape (9, 1, 18, 18, 18, 18, 18, 18, 18, 18) and data type uint8

Interestingly, you use the keyword in your question. I believe `yield` will help you achieve what you want. Instead of returning the entire array, you'd be returning a generator. If you modify it to return only a chunk at a time, you could effectively have a generator that yields one chunk of your total array per iteration. I'm not super strong with `numpy` arrays to give sample code, so I hope this comment can help get you started! — gallen, Jun 26 '20 at 22:50
@gallen I had thought so too. But, the `MemoryError` occurs at `arr = numpy.empty((la, *map(len, arrays)), dtype=dtype)` before the return or yield. Other than putting that in a `try except` statement to detect the `MemoryError`, what is the appropriate shape of `arr` that I should use? — Sun Bear, Jun 26 '20 at 23:04
So I decided to experiment with this, and if you just remove 1 dimension from `*map(len, arrays)`, which would take it from length 8 to length 7, the memory error will not occur. I don't think that is the "answer", per se. But perhaps a starting point for you explore. — gallen, Jun 26 '20 at 23:43
@gallen I actually think it is - the answer that is. Only I would not completely remove the dimension but split it up into shorter (possibly single element) bits. That way the resulting parts of the product already have the right number of columns and will be easy to splice together. — Paul Panzer, Jun 27 '20 at 15:47
@PaulPanzer @gallen I would like the smaller chunks of `return arr.reshape(la, -1).T` to have less number of rows while the number of columns equals `nsample` always. How do I do this while ensuring that the complete cartesian product is correctly passed into the different smaller chunks? — Sun Bear, Jun 27 '20 at 22:14
`(cartesian_product_transpose_pp([*x[:,None],*a[c:]]) for x in cartesian_product_transpose_pp(a[:c]))` here `a` is the array list and `c` is the number of leading dimensions to chop up. — Paul Panzer, Jun 27 '20 at 23:03
@PaulPanzer Thanks. It works when on `nsample=8`, `c=1`& `a=mu_alist`. However, when `nsample=>9`, `MemoryError` occurs. E.g. `arr = np.empty((la, *map(len, arrays)), dtype=dtype)` `MemoryError: Unable to allocate 92.4 GiB for an array with shape (9, 1, 18, 18, 18, 18, 18, 18, 18, 18) and data type uint8`. How do I further chop up the results from a leading dimension that has already been chopped up? Your use of a generator expression is really cool. — Sun Bear, Jun 30 '20 at 10:29
Hm, I do not understand where the `9` in that shape would come form. Anyway, what happens if you set `c=2` or more? — Paul Panzer, Jun 30 '20 at 11:20
@PaulPanzer Only `c=1` and `c=nsample` works. All other values fails. E.g. msg is `arr[i, ...] = a[idx[:i]] ValueError: could not broadcast input array from shape (18,1,1,1,1,1,1) into shape (1,1,1,18,18,18,18,18,18)`. — Sun Bear, Jun 30 '20 at 12:42
@PaulPanzer I used my modified version. Using `la-i` also gives the same error msg. I have included the new test script that I had used in my question. — Sun Bear, Jun 30 '20 at 13:44
Strange, for me it just works. But only with the original version, the modified one also gives me the shape mismatch error you describe. Could you please try again, and make absolutely sure you use the original? Sometimes one forgets to restart the shell and then ones modifications are not propagated because of import caching etc. — Paul Panzer, Jun 30 '20 at 15:07
@PaulPanzer I got the same error from both NumPy ver 1.18.5 and 1.19.0. I confirm that using either `la-i` and `i` gave the same MemoryError. Yes, I did restart the idle-python terminal and had also tested the script on the bash terminal . — Sun Bear, Jun 30 '20 at 16:57
Memory error is a different matter; to address this could you try and go for larger `c`?. For example, if you know that you can do `nsample=10` and now you want to chop up `nsample=14`. then `c` should at least be `4`. — Paul Panzer, Jun 30 '20 at 17:02

score 0 · Answer 1 · answered May 25 '21 at 21:54

0

I also had to solve this problem, but this requires numba to be fast. Here's my answer in a similar thread (originaly limited to a regular cartesian product).

answered May 25 '21 at 21:54

Hadrien Titeux

450
2
11

A Cartesian product function that can yield chunks of result for large arrays

1 Answers1

Linked