0

I have a Numpy matrix and I am looping through every row in the matrix using a for loop and I would like to find the first non-zero value from each row

I found a way to find the first non-zero value on here already but it requires a list as it's argument:

for row in matrix:
    val = next((i for i, x in enumerate(row) if x), None)

Which always returned 0 for val

I've also tried converting the row to a list before calculating 'val'

rowList = row.tolist()

But this also returned the same value

When I print either values the output contains 2 brackets around the list, maybe this has an affect?

ie.

[[0, 0, 1, 2, 3]]

This occurs even after I've converted the row to a list

Is there any way I can convert each row to a list so I can then find the index of the first non-zero value or is there another way to do this that is more simple?

Jonas
  • 121,568
  • 97
  • 310
  • 388
  • You aren't by any chance using the `np.matrix` class (as opposed to `np.array`)? if so try appending `.A` to `matrix`. This will convert to `np.array` which tends to behave more in line with what one would expect. – Paul Panzer Feb 13 '17 at 16:47

3 Answers3

2

Your next expression works:

In [793]: [next((i for i,x in enumerate(row) if x),None) for row in np.eye(10)]
Out[793]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

OK, that gives the index of the first nonzero, but in my sample case that's more interesting that the 1 value.

In [801]: [row.nonzero()[0][0] for row in np.eye(10)]
Out[801]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

But if the array has a row with all 0s, such as in

arr =np.diag(np.arange(0,20,2))

the nonzero version raises an error. It needs to be sensitive to the case where nonzero returns an empty list.

To get values from the idx list use

arr[np.arange(len(idx)), idx]

timings

for a large diagonal array, the nonzero is substantially faster:

In [822]: arr =np.diag(np.arange(1,2000,2))
In [823]: timeit idx = [next((i for i,x in enumerate(row) if x),None) for row in arr]
10 loops, best of 3: 87.6 ms per loop
In [824]: timeit [row.nonzero()[0][0] for row in arr]
100 loops, best of 3: 6.44 ms per loop

for same size array with all the 1s early in the row, the next approach is somewhat faster.

In [825]: arr = np.zeros_like(arr,int)
In [826]: arr[:,10]=1
In [827]: timeit idx = [next((i for i,x in enumerate(row) if x),None) for row in arr]
100 loops, best of 3: 3.61 ms per loop
In [828]: timeit [row.nonzero()[0][0] for row in arr]
100 loops, best of 3: 6.41 ms per loop

There's trade off between short circuiting looping in Python v full looping in C code.


argmax is another way of finding the first nonzero index in each row:

idx = np.argmax(arr>0, axis=1)

With an axis parameter argmax has to iterate by row, and then within the row, but it does so in compiled code. With a boolean argument like this, argmax does short circuit. I've explored this in another question about argmax (or min) and nan values, which also short circuit.

https://stackoverflow.com/a/41324751/901925


Another possibility (channeling @Divakar? )

def foo(arr):
    I,J=np.where(arr>0)
    u,i=np.unique(I,return_index=True)
    return J[i]
Community
  • 1
  • 1
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • I've seen people comment on their upvotes, why am I not allowed to? Anyway, this one is for the `argmax`-short-circuiting-booleans nugget! – Paul Panzer Feb 13 '17 at 18:31
1

You don't need to "convert a numpy array to list", you need to a better way of finding non-zero elements. For that you should use nonzero:

Return the indices of the elements that are non-zero.

And such:

import numpy as np

arr = np.array([0, 0, 9, 2])
print(arr[arr.nonzero()][0])
# 9

Or:

import numpy as np

matrix = np.array([[0, 0, 9, 2], [0, 3, 0, 1]])

for row in matrix:
    print(row[row.nonzero()][0])
# 9
# 3
DeepSpace
  • 78,697
  • 11
  • 109
  • 154
  • For large not too sparse rows this won't be the fastest. The whole point of the OP's generator construction is to not search the entire row but stop as soon as you've found the first nonzero element. Assuming a fixed fraction of nonzero elements, your solution is O(n) in the row length n, the OP's is O(1) avg case. – Paul Panzer Feb 13 '17 at 16:57
  • Yes but a Python loop is orders of magnitude slower than compiled `nonzero`. If the 1st nonzero is near the start of the row, the short circuiting loop may be faster, but if it's a long ways down it will slower. O(n) v O(1) is nearly meaningless when comparing iterative v compiled operations. – hpaulj Feb 13 '17 at 17:27
  • @hpaulj Yes, it depends on the actual numbers, but having a large nonsparse matrix is not so rare a scenario. Dismissing the generator trick out of hand, doesn't look like good advice to me. – Paul Panzer Feb 13 '17 at 17:46
  • I tested the trade off a bit. I also realized that `argmax` can work, and in some cases short-circuits. – hpaulj Feb 13 '17 at 18:19
0

My guess is that like many others before you including myself you have been tripped up by the np.matrix class.

Slicing instances of this class gives unexpected results:

>> id = np.identity(4)
>>> type(id)
<class 'numpy.ndarray'>
>>> id[2]
array([ 0.,  0.,  1.,  0.])    #  shape == (4,)
>>> id_m = np.matrix(id)
>> type(id_m)
<class 'numpy.matrixlib.defmatrix.matrix'>
>>> id_m[2]
matrix([[ 0.,  0.,  1.,  0.]]) #  shape == (4, 1)

As you suspected this is probably also the reason why your generator trick doesn't work. Iterating over a row of an np.matrix will because it's nested return the entire row in one go and then stop.

If for some reason you are handling a matrix but would prefer it to behave like an array you can use the .A attribute.

>>> id_m.A
array([[ 1.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  1.]])

One last remark:

Do not convert your rows to list here! The point of the generator trick you are using is to stop searching as soon as possible. Imagine your rows have 100,000 elements each and every other is nonzero. The generator will look at the first few and as soon as it has found the first nonzero (almost certainly within the first 50, say) it will skip the rest of the row (> 99,950). If you convert to list you are throwing this saving away, because to generate the equivalent list every single element has to be read. That is also the reason why in this case a generator can compete with vectorised numpy functions.

Paul Panzer
  • 51,835
  • 3
  • 54
  • 99