3

Let's say I have the following arrays (in reality it is a KxNxM masked array with 1514764800 fields stored like: np.ma.array(data, mask=mask, dtype=np.float32)):

import numpy as np
data = np.random.random((3,4,4))
mask = np.zeros((3,4,4), dtype=bool)
mask[1,2,2] = 1
mask[2,2,2] = 1
mask[2,1,3] = 1
mask[:,2,0] = 1

Using the mask I can easily reduce the big dataset to the valid ones:

newdata = data[mask]
newdata
array([ 0.91336042,  0.78399595,  0.9466537 ,  0.75347407,  0.8213428 ,
    0.13172648])

In order to know at which row/column and 3rd dimension they were located I can use:

pos = np.where(mask)
pos
(array([0, 1, 1, 2, 2, 2], dtype=int64),
array([2, 2, 2, 1, 2, 2], dtype=int64),
array([0, 0, 2, 3, 0, 2], dtype=int64))

This information ("newdata" and "pos") can be saved and I save a lot of memory and storage space. However, how can I calculate e.g. the mean of all fields at data[:,2,2] (in the original data)? In my case, newdata has ~5300000 entries.

HyperCube
  • 3,870
  • 9
  • 41
  • 53
  • Although sometimes tricky to use, have you tried [Masked arrays](http://docs.scipy.org/doc/numpy/reference/maskedarray.html)? (I assume what you mean is that you want to compute the mean excluding the masked values). – Iguananaut Dec 16 '13 at 16:24
  • The goal is to reduce the big array to the valid ones, conserving their positions. My data are stored in masked arrays. However, I would like to reduce the big array (np.float32 with ~1514764800 elements) to a workable data set considering only the valid values. – HyperCube Dec 16 '13 at 16:30
  • That's basically the point of masked arrays. You're using a normal `ndarray` and applying a boolean mask to it directly, but the `numpy.ma` module in more recent versions of Numpy has a special `masked_array` type for this purpose. – Iguananaut Dec 16 '13 at 16:33
  • I know ;-) they are already stored like: np.ma.array(data, mask=mask, dtype=np.float32), but the size of the masked array is just as big as the data array above. I'd like to reduce this, i.e. don't save the invalid ones at all. – HyperCube Dec 16 '13 at 16:38
  • Okay, I see what you're saying now. That was unclear since you didn't specify that in your example. – Iguananaut Dec 16 '13 at 16:41
  • I've used this same trick of storing mask values with `np.where` instead of the entire masked array before, but I don't have a great solution off the top of my head for performing arbitrary slices with it. I could definitely code up a way to do that, but I have to wonder if there's anything built into Numpy... – Iguananaut Dec 16 '13 at 16:54
  • 2
    You could try and convert it to a sparse matrix? – M4rtini Dec 16 '13 at 16:56
  • Or you can try and store the dataset in hd5, with pytables and do out of memory computations – M4rtini Dec 16 '13 at 16:57

2 Answers2

2

I suggest you use a sparse array, and not a masked array, if the ratio of unmasked values is smaller than, say, 10%. See:

Regarding 3D, you can hack the problem by converting two of the dimensions into one, if you don't need them for fast calculations.

Community
  • 1
  • 1
Yariv
  • 12,945
  • 19
  • 54
  • 75
  • Never heard of sparse matrices before! I will look into it. However, they only seem to be capable of 2d matrices. In my case, I have multiple values for the same row/column. – HyperCube Dec 16 '13 at 17:17
0

One thing that would work for the specific case you mentioned would look like this:

In [33]: newmask = pos[0][np.logical_and(pos[1] == 2, pos[2] == 2)]

In [34]: data[:,2,2][newmask]
Out[34]: array([ 0.83677029,  0.34970232])

Something like this could be generalized to work for arbitrary slices, but I don't have time at the moment to provide a full solution. I have to wonder if this is built into Numpy somewhere.

Iguananaut
  • 21,810
  • 5
  • 50
  • 63
  • Considering that data (the old big array) should be avoided afterwards, this does not help? – HyperCube Dec 16 '13 at 18:10
  • 1
    I see--once you've applied the mask you want to discard the masked-out data entirely. In that case I agree with @Yariv's answer to use sparse matrices. By the way if you want to take a look at a project that makes extensive use of sparse matrices, the [QuTiP](http://qutip.org/) source code is full of good examples (albeit not for 3D arrays). – Iguananaut Dec 16 '13 at 18:49
  • Thanks for your effort! – HyperCube Dec 16 '13 at 18:53