Reduce array with mask and np.where and calculate with it

Question

Let's say I have the following arrays (in reality it is a KxNxM masked array with 1514764800 fields stored like: np.ma.array(data, mask=mask, dtype=np.float32)):

import numpy as np
data = np.random.random((3,4,4))
mask = np.zeros((3,4,4), dtype=bool)
mask[1,2,2] = 1
mask[2,2,2] = 1
mask[2,1,3] = 1
mask[:,2,0] = 1

Using the mask I can easily reduce the big dataset to the valid ones:

newdata = data[mask]
newdata
array([ 0.91336042,  0.78399595,  0.9466537 ,  0.75347407,  0.8213428 ,
    0.13172648])

In order to know at which row/column and 3rd dimension they were located I can use:

pos = np.where(mask)
pos
(array([0, 1, 1, 2, 2, 2], dtype=int64),
array([2, 2, 2, 1, 2, 2], dtype=int64),
array([0, 0, 2, 3, 0, 2], dtype=int64))

This information ("newdata" and "pos") can be saved and I save a lot of memory and storage space. However, how can I calculate e.g. the mean of all fields at data[:,2,2] (in the original data)? In my case, newdata has ~5300000 entries.

Although sometimes tricky to use, have you tried [Masked arrays](http://docs.scipy.org/doc/numpy/reference/maskedarray.html)? (I assume what you mean is that you want to compute the mean excluding the masked values). — Iguananaut, Dec 16 '13 at 16:24
The goal is to reduce the big array to the valid ones, conserving their positions. My data are stored in masked arrays. However, I would like to reduce the big array (np.float32 with ~1514764800 elements) to a workable data set considering only the valid values. — HyperCube, Dec 16 '13 at 16:30
That's basically the point of masked arrays. You're using a normal `ndarray` and applying a boolean mask to it directly, but the `numpy.ma` module in more recent versions of Numpy has a special `masked_array` type for this purpose. — Iguananaut, Dec 16 '13 at 16:33
I know ;-) they are already stored like: np.ma.array(data, mask=mask, dtype=np.float32), but the size of the masked array is just as big as the data array above. I'd like to reduce this, i.e. don't save the invalid ones at all. — HyperCube, Dec 16 '13 at 16:38
Okay, I see what you're saying now. That was unclear since you didn't specify that in your example. — Iguananaut, Dec 16 '13 at 16:41
I've used this same trick of storing mask values with `np.where` instead of the entire masked array before, but I don't have a great solution off the top of my head for performing arbitrary slices with it. I could definitely code up a way to do that, but I have to wonder if there's anything built into Numpy... — Iguananaut, Dec 16 '13 at 16:54
Or you can try and store the dataset in hd5, with pytables and do out of memory computations — M4rtini, Dec 16 '13 at 16:57

score 2 · Answer 1 · edited May 23 '17 at 12:28

2

I suggest you use a sparse array, and not a masked array, if the ratio of unmasked values is smaller than, say, 10%. See:

Regarding 3D, you can hack the problem by converting two of the dimensions into one, if you don't need them for fast calculations.

edited May 23 '17 at 12:28

Community

1
1

answered Dec 16 '13 at 17:09

Yariv

12,945
19
54
75

Never heard of sparse matrices before! I will look into it. However, they only seem to be capable of 2d matrices. In my case, I have multiple values for the same row/column. – HyperCube Dec 16 '13 at 17:17

score 0 · Answer 2 · answered Dec 16 '13 at 17:01

0

One thing that would work for the specific case you mentioned would look like this:

In [33]: newmask = pos[0][np.logical_and(pos[1] == 2, pos[2] == 2)]

In [34]: data[:,2,2][newmask]
Out[34]: array([ 0.83677029,  0.34970232])

Something like this could be generalized to work for arbitrary slices, but I don't have time at the moment to provide a full solution. I have to wonder if this is built into Numpy somewhere.

answered Dec 16 '13 at 17:01

Iguananaut

21,810
5
50
63

Considering that data (the old big array) should be avoided afterwards, this does not help? – HyperCube Dec 16 '13 at 18:10
1

I see--once you've applied the mask you want to discard the masked-out data entirely. In that case I agree with @Yariv's answer to use sparse matrices. By the way if you want to take a look at a project that makes extensive use of sparse matrices, the [QuTiP](http://qutip.org/) source code is full of good examples (albeit not for 3D arrays). – Iguananaut Dec 16 '13 at 18:49
Thanks for your effort! – HyperCube Dec 16 '13 at 18:53

Reduce array with mask and np.where and calculate with it

2 Answers2