Let's say I have the following arrays (in reality it is a KxNxM masked array with 1514764800 fields stored like: np.ma.array(data, mask=mask, dtype=np.float32)):
import numpy as np
data = np.random.random((3,4,4))
mask = np.zeros((3,4,4), dtype=bool)
mask[1,2,2] = 1
mask[2,2,2] = 1
mask[2,1,3] = 1
mask[:,2,0] = 1
Using the mask I can easily reduce the big dataset to the valid ones:
newdata = data[mask]
newdata
array([ 0.91336042, 0.78399595, 0.9466537 , 0.75347407, 0.8213428 ,
0.13172648])
In order to know at which row/column and 3rd dimension they were located I can use:
pos = np.where(mask)
pos
(array([0, 1, 1, 2, 2, 2], dtype=int64),
array([2, 2, 2, 1, 2, 2], dtype=int64),
array([0, 0, 2, 3, 0, 2], dtype=int64))
This information ("newdata" and "pos") can be saved and I save a lot of memory and storage space. However, how can I calculate e.g. the mean of all fields at data[:,2,2] (in the original data)? In my case, newdata has ~5300000 entries.