8

I have 3 numpy arrays and need to form the cartesian product between them. Dimensions of the arrays are not fixed, so they can take different values, one example could be A=(10000, 50), B=(40, 50), C=(10000,50).

Then, I perform some processing (like a+b-c) Below is the function that I am using for the product.

def cartesian_2d(arrays, out=None):

    arrays = [np.asarray(x) for x in arrays]
    dtype = arrays[0].dtype

    n = np.prod([x.shape[0] for x in arrays])
    if out is None:
        out = np.empty([n, len(arrays), arrays[0].shape[1]], dtype=dtype)

    m = n // arrays[0].shape[0]
    out[:, 0] = np.repeat(arrays[0], m, axis=0)
    if arrays[1:]:
        cartesian_2d(arrays[1:], out=out[0:m, 1:, :])
        for j in range(1, arrays[0].shape[0]):
            out[j * m:(j + 1) * m, 1:] = out[0:m, 1:]
    return out

a = [[ 0, -0.02], [1, -0.15]]
b = [[0, 0.03]]

result = cartesian_2d([a,b,a])

// array([[[ 0.  , -0.02],
    [ 0.  ,  0.03],
    [ 0.  , -0.02]],

   [[ 0.  , -0.02],
    [ 0.  ,  0.03],
    [ 1.  , -0.15]],

   [[ 1.  , -0.15],
    [ 0.  ,  0.03],
    [ 0.  , -0.02]],

   [[ 1.  , -0.15],
    [ 0.  ,  0.03],  
    [ 1.  , -0.15]]])

The output is the same as with itertools.product. However, I am using my custom function to take advantage of numpy vectorized operations, which is working fine compared to itertools.product in my case.

After this, I do

result[:, 0, :] + result[:, 1, :] - result[:, 2, :]

//array([[ 0.  ,  0.03],
       [-1.  ,  0.16],
       [ 1.  , -0.1 ],
       [ 0.  ,  0.03]])

So this is the final expected result.

The function works as expected as long as my array fits in memory. But my usecase requires me to work with huge data and I get a MemoryError at the line np.empty() since it is unable to allocate the memory required. I am working with circa 20GB data at the moment and this might increase in future.

These arrays represent vectors and will have to be stored in float, so I cannot use int. Also, they are dense arrays, so using sparse is not an option.

I will be using these arrays for further processing and ideally I would not like to store them in files at this stage. So memmap / h5py format may not help, although I am not sure of this.

If there are other ways to form this product, that would be okay too.

As I am sure there are applications with way larger datasets than this, I hope someone has encountered such issues before and would like to know how to handle this issue. Please help.

swathis
  • 336
  • 5
  • 17
  • 1
    Is using Cython or Numba not an option? – Hameer Abbasi Feb 18 '18 at 19:31
  • @HameerAbbasi - I haven't used either one before. Could you please elaborate? – swathis Feb 19 '18 at 13:45
  • Numba is a library that can speed up the calculations of many Numpy tasks by adding a simple decorator. Cython can do the same but can interface with pure C/C++ at the same time, it needs more customization and manual optimization. – Hameer Abbasi Feb 19 '18 at 14:21
  • I meant, if you had some directions on how to form the product by not creating the arrays. I am currently trying to figure this out, in order to overcome the MemoryError. Unless I find a way to avoid the huge array creation, I am not quite sure if these other libraries will help, since it is not a specific calculation I am trying to optimize. I will check them out though. Thank you. – swathis Feb 19 '18 at 15:38
  • You would essentially write the loop directly in Cython/Numba and do whatever operations you want to do there instead of actually explicitly calculating the Cartesian product. There is essentially no way to reduce the memory of something that big. – Hameer Abbasi Feb 19 '18 at 15:41
  • @HameerAbbasi Will definitely try one of these libraries soon. Thanks again! – swathis Feb 19 '18 at 16:44
  • "I am working with circa 20GB" Can you give example dimensions for each of the arrays which will be given as input to the product? – myrtlecat Feb 20 '18 at 17:48
  • @myrtlecat - Edited the original post to include this information. – swathis Feb 20 '18 at 20:27
  • Is @Nils Werner answer working for you (A==C)? His code is a lot easier to understand than your code and can be optimized more easily... – max9111 Feb 21 '18 at 10:41
  • There seems to be a copy&paste error in your code block (2nd line of 1st if statement). – Paul Panzer Feb 21 '18 at 12:44
  • @PaulPanzer - Thanks for pointing it out. Should be corrected now. – swathis Feb 23 '18 at 16:35
  • @max9111 - The code by Nils Werner links to another function which does something similar to form the cartesian. Check out the function in the linked page. – swathis Feb 23 '18 at 16:38

3 Answers3

3

If at least your result fits in memory

The following produces your expected result without relying on an intermediate three times the size of the result. It uses broadcasting.

Please note that almost any NumPy operation is broadcastable like this, so in practice there is probably no need for an explicit cartesian product:

#shared dimensions:
sh = a.shape[1:]
aba = (a[:, None, None] + b[None, :, None] - a[None, None, :]).reshape(-1, *sh)
aba
#array([[ 0.  ,  0.03],
#       [-1.  ,  0.16],
#       [ 1.  , -0.1 ],
#       [ 0.  ,  0.03]])

Addressing result rows by 'ID'

You may consider leaving out the reshape. That would allow you to address the rows in the result by combined index. If your component ID's are just 0,1,2,... like in your example this would be the same as the combined ID. For example aba[1,0,0] would correspond to the row obtained as second row of a + first row of b - first row of a.

A bit of explanation

Broadcasting: When for example adding two arrays their shapes do not have to be identical, only compatible because of broadcasting. Broadcasting is in a sense a generalization of adding scalars to arrays:

    [[2],                 [[7],   [[2],
7 +  [3],     equiv to     [7], +  [3],
     [4]]                  [7]]    [4]]

Broadcasting:

              [[4],            [[1, 2, 3],   [[4, 4, 4],
[[1, 2, 3]] +  [5],  equiv to   [1, 2, 3], +  [5, 5, 5],
               [6]]             [1, 2, 3]]    [6, 6, 6]]

For this to work each dimension of each operand must be either 1 or equal to the corresponding dimension in each other operand unless it is 1. If an operand has fewer dimensions than the others its shape is padded with ones on the left. Note that the equiv arrays shown in the illustration are not explicitly created.

If the result also does not fit

In that case I don't see how you can possibly avoid using storage, so h5py or something like that it is.

Removing the first column from each operand

This is just a matter of slicing:

a_no_id = a[:, 1:]

etc. Note that, unlike Python lists, NumPy arrays when sliced do not return a copy but a view. Therefore efficiency (memory or runtime) is not an issue here.

Paul Panzer
  • 51,835
  • 3
  • 54
  • 99
  • Thanks for the nice answer. While being concise, it also eliminates the need for the intermediate array. Could you maybe elaborate a bit on how exactly the broadcast is working here. I did go through the documentation briefly, but am not totally clear with it. Also, what if even the result does not fit into memory, is there a solution for this? – swathis Feb 23 '18 at 16:40
  • Is there also an efficient way to exclude - say the first entry(/column value) from each of the arrays, since they actually represent ids.. and concatenate these identifiers from the different arrays. ? a = [[ 0, -0.02], [1, -0.15]] b = [[0, 0.03]] so result = array([[ '0_0_0' , 0.03], # ['0_0_1' , 0.16], # [ '1_0_0' , -0.1 ], # [ '1_0_1' , 0.03]]) I understand that storing them as strings may not be totally efficient, so am looking for a better way. – swathis Feb 23 '18 at 16:51
  • @swathis I tried to address your additional questions in the answer. Please have a look. – Paul Panzer Feb 23 '18 at 19:07
  • 1
    @ Paul Panzer - Thank you very much for the detailed explanation and for clarifying! I was able to achieve what I wanted based on this. – swathis Feb 26 '18 at 23:40
  • One more question - In case I wanted to compute a row-wise operation(like norm or sum or mean) on the result (i.e.,aba) and just store this? I tried and seems like memory is being allocated for aba too. Is this possible without storing aba itself? I am asking since it might suffice to just have this computed and also help in the memory issue since aba itself will take up lot of space. – swathis Feb 27 '18 at 15:13
1

An alternate solution is to create a cartesian product of indices (which is easier, as solutions for cartesian products of 1D arrays exist):

idx = cartesian_product(
    np.arange(len(a)),
    np.arange(len(b)) + len(a),
    np.arange(len(a))
)

And then use fancy indexing to create the output array:

x = np.concatenate((a, b))
result = x[idx.ravel(), :].reshape(*idx.shape, -1)
Nils Werner
  • 34,832
  • 7
  • 76
  • 98
  • While it is a good idea to form a cartesian product of the indices, this approach would still end up forming the intermediate array, thus not solving the memory issue because it will not fit. – swathis Feb 23 '18 at 16:37
1

Writing results efficiently on disk

At first a few minds on the size of the resulting data.

Size of the result data

size_in_GB = A.shape[0]**2*A.shape[1]*B.shape[0]*(size_of_datatype)/1e9

In your question you mentioned A.shape=(10000,50), B=(40,50). Using float64 your result will be aproximately 1600 GB. This can be done without problems if you have enough disk space, but you have to think what you wan't to do with the data next. Maybe this is only a intermediate result and processing the data in blocks is possible.

If this is not the case here is an example how to handle 1600GB of data efficiently (RAM usage will be about 200 MB). The troughput should be around 200 MB/s on realistic data.

The code calculating the results is from @PaulPanzer.

import numpy as np
import tables #register blosc
import h5py as h5
import h5py_cache as h5c

a=np.arange(500*50).reshape(500, 50)
b=np.arange(40*50).reshape(40, 50)

# isn't well documented, have a look at https://github.com/Blosc/hdf5-blosc
compression_opts=(0, 0, 0, 0, 5, 1, 1)
compression_opts[4]=9 #compression level 0...9
compression_opts[5]=1 #shuffle
compression_opts[6]=1 #compressor (I guess that's lz4)

File_Name_HDF5='Test.h5'
f = h5.File(File_Name_HDF5, 'w',chunk_cache_mem_size=1024**2*300)
dset = f.create_dataset('Data', shape=(a.shape[0]**2*b.shape[0],a.shape[1]),dtype='d',chunks=(a.shape[0]*b.shape[0],1),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)

#Write the data
for i in range(a.shape[0]):
  sh = a.shape[1:]
  aba = (a[i] + b[:, None] - a).reshape(-1, *sh)
  dset[i*a.shape[0]*b.shape[0]:(i+1)*a.shape[0]*b.shape[0]]=aba

f.close()

Reading the data

File_Name_HDF5='Test.h5'
f = h5c.File(File_Name_HDF5, 'r',chunk_cache_mem_size=1024**2*300)
dset=f['Data']
chunks_size=500
for i in range(0,dset.shape[0],chunks_size):
  #Iterate over the first column
  data=dset[i:i+chunks_size,:] #avoid excessive calls to the hdf5 library
  #Do something with the data

f.close()

f = h5c.File(File_Name_HDF5, 'r',chunk_cache_mem_size=1024**2*300)
dset=f['Data']
for i in range(dset.shape[1]):
  # Iterate over the second dimension
  # fancy indexing e.g.[:,i] will be much slower
  # use np.expand_dims or in this case np.squeeze after the read operation from the dset
  # if you wan't to have the same result than [:,i] (1 dim array)
  data=dset[:,i:i+1] 
  #Do something with the data

f.close()

On this test example I get a write throughput of about 550 M/s, a read throuhput of about (500 M/s first dim, 1000M/s second dim) and a compression ratio of 50. Numpy memmap will only provide acceptable speed if you read or write data along the fastest changing direction (in C the last dimension), with a chunked data format used by HDF5 here, this isn't a problem at all. Compression is also not possible with Numpy memmap, leading to higher file sizes and slower speed.

Please note that the compression filter and chunk shape have to be set up to your needs. This depends on how you wan't to read the data afterwards and the actual data.

If you do something completely wrong, the perfornance can be 10-100 times slower compared to a proper way to do it (e.g. the chunkshape can be optimized for the first or the second read example).

max9111
  • 6,272
  • 1
  • 16
  • 33
  • 1
    Sorry for the nit pick, but as OP is apparently just learning about broadcasting, I think we should be precise: Since you are slicing away the first dimension of the full `aba` it would be more to the point to use `aba = (a[i, None, None] + b[:, None] - a[None, :]).reshape(-1, *sh)` or even `aba = (a[i] + b[:, None] - a).reshape(-1, *sh)`. Your code also works but only because the first operand implicitly creates a new axis at the left (which is then immediately reshaped away). – Paul Panzer Feb 23 '18 at 20:05
  • Thanks for your hints, I will correct this tomorrow (I have only my phone available now) – max9111 Feb 23 '18 at 20:11