Mask numpy array based on index

Question

How do I mask an array based on the actual index values?

That is, if I have a 10 x 10 x 30 matrix and I want to mask the array when the first and second index equal each other.

For example, [1, 1 , :] should be masked because 1 and 1 equal each other but [1, 2, :] should not because they do not.

I'm only asking this with the third dimension because it resembles my current problem and may complicate things. But my main question is, how to mask arrays based on the value of the indices?

Off topic, your avatar matches your question topic nicely :P — askewchan, Sep 17 '13 at 22:13
Yep, just trying to draw Z's all over my code. Thanks for helping me with that. — aleph4, Sep 17 '13 at 23:56

askewchan · Accepted Answer · 2013-09-25T17:00:19.190

7

In general, to access the value of the indices, you can use np.meshgrid:

i, j, k = np.meshgrid(*map(np.arange, m.shape), indexing='ij')
m.mask = (i == j)

The advantage of this method is that it works for arbitrary boolean functions on i, j, and k. It is a bit slower than the use of the identity special case.

In [56]: %%timeit
   ....: i, j, k = np.meshgrid(*map(np.arange, m.shape), indexing='ij')
   ....: i == j
10000 loops, best of 3: 96.8 µs per loop

As @Jaime points out, meshgrid supports a sparse option, which doesn't do so much duplication, but requires a bit more care in some cases because they don't broadcast. It will save memory and speed things up a little. For example,

In [77]: x = np.arange(5)

In [78]: np.meshgrid(x, x)
Out[78]: 
[array([[0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4]]),
 array([[0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1],
       [2, 2, 2, 2, 2],
       [3, 3, 3, 3, 3],
       [4, 4, 4, 4, 4]])]

In [79]: np.meshgrid(x, x, sparse=True)
Out[79]: 
[array([[0, 1, 2, 3, 4]]),
 array([[0],
       [1],
       [2],
       [3],
       [4]])]

So, you can use the sparse version as he says, but you must force the broadcasting as such:

i, j, k = np.meshgrid(*map(np.arange, m.shape), indexing='ij', sparse=True)
m.mask = np.repeat(i==j, k.size, axis=2)

And the speedup:

In [84]: %%timeit
   ....: i, j, k = np.meshgrid(*map(np.arange, m.shape), indexing='ij', sparse=True)
   ....: np.repeat(i==j, k.size, axis=2)
10000 loops, best of 3: 73.9 µs per loop

edited Sep 25 '13 at 17:00

answered Sep 17 '13 at 22:28

askewchan

45,161
17
118
134

1

Why not use `np.arange` instead of `range`, should save time. – Daniel Sep 17 '13 at 22:38
3

Nice use of `map`... But you are creating 3 full masks of `(10, 10, 30)`... If you set `sparse=True` in the call to `np.meshgrid`, you reduce the memory footprint substantially, but you need to include all dimensions in the creation of the mask, e.g. `(i == j) & (k >= 0)` runs twice as fast as your solution. – Jaime Sep 17 '13 at 22:51
Thanks @Jaime, I was trying to figure out how to do it with the sparser versions but couldn't! – askewchan Sep 17 '13 at 22:52
1

It's a pity that boolean mask indexing does not support broadcasting. There may be some good reason for it, but it would make some operations ridiculously easy. – Jaime Sep 17 '13 at 22:57
1

`sparse=True` seems like a lot of trouble for 8us, at least in this case :). – Bi Rico Sep 18 '13 at 07:06
@BiRico Yep, just there for completeness. Honestly, if time is an issue, the `np.identity` answer is best anyway. – askewchan Sep 18 '13 at 12:28

askewchan · Answer 2 · 2013-09-17T22:41:09.833

0

In your special case of wanting to mask the diagonals, you can use the np.identity() function which returns ones along the diagonal. Since you have the third dimension, we have to add that third dimension to the the identity matrix:

m.mask = np.identity(10)[...,None]*np.ones((1,1,30))

There might be a better way of constructing that array, but it is basically stacking 30 of the np.identity(10) array. For example, this is equivalent:

np.dstack((np.identity(10),)*30)

but slower:

In [30]: timeit np.identity(10)[...,None]*np.ones((1,1,30))
10000 loops, best of 3: 40.7 µs per loop

In [31]: timeit np.dstack((np.identity(10),)*30)
1000 loops, best of 3: 219 µs per loop

And @Ophion's suggestions

In [33]: timeit np.tile(np.identity(10)[...,None], 30)
10000 loops, best of 3: 63.2 µs per loop

In [71]: timeit np.repeat(np.identity(10)[...,None], 30)
10000 loops, best of 3: 45.3 µs per loop

edited Sep 17 '13 at 22:41

answered Sep 17 '13 at 22:12

askewchan

45,161
17
118
134

`np.tile(np.identity(10)[...,None], 30)` – Daniel Sep 17 '13 at 22:21
1

Well thats interesting. `tile` is basically a wrapper on `repeat`. So if you run `np.repeat(np.identity(10)[...,None], 30, axis=-1)` you skip some extra `if` statements, but `np.identity(10)[...,None]*np.ones((1,1,30))` is universally faster for any array size in 1 dimension. Good to know. – Daniel Sep 17 '13 at 22:35
@Ophion Cool, I'd never considered that along dimensions of length 1, `tile` and `repeat` are equivalent. – askewchan Sep 17 '13 at 22:43
1

You do have to specify an axis as `np.repeat` is odd, instead of the default `axis=-1` as with most function it is `axis=None`. If you do not explicitly pass the axis keyword you get a 1D array out and will have to reshape it manually. – Daniel Sep 17 '13 at 22:49

Mask numpy array based on index

2 Answers2

Linked