2

How do I get the masked data only without flattening the data into a 1D array? That is, suppose I have a numpy array

a = np.array([[0, 1, 2, 3],
              [0, 1, 2, 3],
              [0, 1, 2, 3]])

and I mask all elements greater than 1,

b = ma.masked_greater(a, 1)

masked_array(data =
 [[0 1 -- --]
 [0 1 -- --]
 [0 1 -- --]],
             mask =
 [[False False  True  True]
 [False False  True  True]
 [False False  True  True]],
       fill_value = 999999)

How do I get only the masked elements without flattening the output? That is, I need to get

array([[ 2, 3],
       [2, 3],
       [2, 3]])
Aditya369
  • 834
  • 8
  • 20
  • 6
    This doesn't seem possible in general -- What if every row doesn't have the same number of masked elements? If there's a constraint such that every row _does_ have the same number of elements, then you can flatten and reshape . . . – mgilson Dec 26 '15 at 04:23
  • @mgilson Every row wouldn't have the same number of masked elements in my case. As a different version of the same question, how do I delete nan values in a 2D array. I only want to delete the values and shorten the corresponding rows. There will be a different number of nan elements in each row. – Aditya369 Dec 26 '15 at 04:38
  • 2
    Numpy doesn't _really_ work that way. 2D arrays must have the same number of elements in each row. What you'd be left with is not a 2D array -- It'd be a 1D array of `object` (e.g. lists or 1D arrays). In any event, it probably won't behave like a 2D array any more (and there's no simple 1-liner to achieve this AFAIK, you'll probably need a loop that masks each row individually)... – mgilson Dec 26 '15 at 04:46
  • @mgilson are you intersted in working on a kaggle competition with me? – Aditya369 Dec 26 '15 at 05:04
  • 1
    If you expect, in general, a ragged array (or list of lists), your example should illustrate that. Otherwise an expression like `a[:,np.all(b.mask,axis=0)]` will give the correct array. – hpaulj Dec 26 '15 at 08:23

3 Answers3

2

Lets try an example that produces a ragged result - different number of 'masked' values in each row.

In [292]: a=np.arange(12).reshape(3,4)
In [293]: a
Out[293]: 
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
In [294]: a<6
Out[294]: 
array([[ True,  True,  True,  True],
       [ True,  True, False, False],
       [False, False, False, False]], dtype=bool)

The flattened list of values that match this condition. It can't return a regular 2d array, so it has to revert to a flattened array.

In [295]: a[a<6]
Out[295]: array([0, 1, 2, 3, 4, 5])

do the same thing, but iterating row by row

In [296]: [a1[a1<6] for a1 in a]
Out[296]: [array([0, 1, 2, 3]), array([4, 5]), array([], dtype=int32)]

Trying to make an array of the result produces an object type array, which is little more than a list in an array wrapper:

In [297]: np.array([a1[a1<6] for a1 in a])
Out[297]: array([array([0, 1, 2, 3]), array([4, 5]), array([], dtype=int32)], dtype=object)

The fact that the result is ragged is a good indicator that it is difficult, if not impossible, to perform that action with one vectorized operation.


Here's another way of producing the list of arrays. With sum I find how many elements there are in each row, and then use this to split the flattened array into sublists.

In [320]: idx=(a<6).sum(1).cumsum()[:-1]
In [321]: idx
Out[321]: array([4, 6], dtype=int32)
In [322]: np.split(a[a<6], idx)
Out[322]: [array([0, 1, 2, 3]), array([4, 5]), array([], dtype=float64)]

It does use 'flattening'. And for these small examples it is slower than the row iteration. (Don't worry about the empty float array, split had to construct something and used a default dtype. )


A different mask, without empty rows clearly shows the equivalence of the 2 approaches.

In [344]: mask=np.tri(3,4,dtype=bool)  # lower tri
In [345]: mask
Out[345]: 
array([[ True, False, False, False],
       [ True,  True, False, False],
       [ True,  True,  True, False]], dtype=bool)
In [346]: idx=mask.sum(1).cumsum()[:-1]
In [347]: idx
Out[347]: array([1, 3], dtype=int32)
In [348]: [a1[m] for a1,m in zip(a,mask)]
Out[348]: [array([0]), array([4, 5]), array([ 8,  9, 10])]
In [349]: np.split(a[mask],idx)
Out[349]: [array([0]), array([4, 5]), array([ 8,  9, 10])]
hpaulj
  • 221,503
  • 14
  • 230
  • 353
1

Zip the two lists together, and then filter them out:

data = [[0, 1, 1, 1], [0, 1, 1, 1], [0, 1, 1, 1]]

mask = [[False, False,  True,  True],
 [False, False,  True,  True],
 [False, False,  True,  True]]

zipped = zip(data, mask) # [([0, 1, 1, 1], [False, False, True, True]), ([0, 1, 1, 1], [False, False, True, True]), ([0, 1, 1, 1], [False, False, True, True])]

masked = []
for lst, mask in zipped:
    pairs = zip(lst, mask)  # [(0, False), (1, False), (1, True), (1, True)]
    masked.append([num for num, b in pairs if b])

print(masked)  # [[1, 1], [1, 1], [1, 1]]

or more succinctly:

zipped = [...]
masked = [[num for num, b in zip(lst, mask) if b] for lst, mask in zipped]
print(masked)  # [[1, 1], [1, 1], [1, 1]]
Martin Konecny
  • 57,827
  • 19
  • 139
  • 159
  • Your answer is brilliant. Unfortunately, I have to do this for a 100000 elements. So if I use a for loop, the program takes too long. Sorry for not posting the question properly. I'll wait for some time to see if anyone gives me a vectorized answer, and if not, accept this answer. I'll post the actual problem that I have in a separate question now. – Aditya369 Dec 26 '15 at 04:47
  • Ok link to your new question when it's ready. The way I understand it, since you are dealing with a list (not a integer for easy bit masking) you need to visit each element regardless since you need to read the corresponding value of it's mask - making a iteration necessary. But I'll read your new question to see if I'm understanding correctly. – Martin Konecny Dec 26 '15 at 04:51
  • After reading what mgilson said, I think this is the best solution I can get for now. I'll link to the new question as soon as I can post it. Also, if either @Martin Konecny or mgilson is interested in a kaggle competition and would like to team up with me, do tell me. It's really interesting. – Aditya369 Dec 26 '15 at 05:03
  • Here's the link to the new question. http://stackoverflow.com/questions/34525118/find-cumsum-of-subarrays-split-by-indices-for-numpy-array-efficiently – Aditya369 Dec 30 '15 at 07:43
1

Due to vectorization in numpy you can use np.where to select items from the first array and use None (or some arbitrary value) to indicate the places that a value has been masked out. Note that this means you have to use a less compact representation for the array so may want to use -1 or some special value.

import numpy as np

a = np.array([
    [0, 1, 2, 3],
    [0, 1, 2, 3],
    [0, 1, 2, 3]])

mask = np.array([[ True,  True,  True,  True],
    [ True, False,  True,  True],
    [False,  True,  True, False]])

np.where(a, np.array, None)

This produces

array([[0, 1, 2, 3],
   [0, None, 2, 3],
   [None, 1, 2, None]], dtype=object)
Greg Nisbet
  • 6,710
  • 3
  • 25
  • 65
  • 1
    This doesn't remove the elements though. It just gives them a different value. Also, this can be done directly by using the ma.fill_value function. – Aditya369 Dec 26 '15 at 05:05
  • @Aditya369 you're absolutely right. This is equivalent is to `ma.set_fill_value(None)` since you're already using a masked array. Are you looking to replace each row with a jagged row with the masked items removed? – Greg Nisbet Dec 26 '15 at 05:21
  • Yes. That's exactly what I want. But without using a for loop. – Aditya369 Dec 26 '15 at 05:24