8

I'm using numpy.argmax to calculate the first index where True can be found in a vector of bools. Invoking on a pandas.Series gives me the Series index rather than the element index.

I found a subtle bug in my code that popped up when the vector was all False; returning index 0 in this case seems dangerous since True could very well be the case where True was in the first element. What's the design choice for this return value?

>>> numpy.argmax([False,False,False])
0
>>> numpy.argmax([True, False, True])
0

>>> s = pandas.Series( [ False, False, False ] , index=[3,6,9] )
>>> numpy.argmax(s)
3
>>> s1 = pandas.Series( [ True, False, False ] , index=[3,6,9] )
>>> numpy.argmax(s1)
3
jxramos
  • 7,356
  • 6
  • 57
  • 105
  • 2
    What else should it return? `False` is the maximum value. Think about `np.argmax([0,0,0])`. – hpaulj Aug 18 '17 at 21:55
  • You're absolutely right, my thinking got somehow confused with a find operation in like C++ or something where failure to find would return `-1`, but this isn't a find operation of course. Got to love those hard lessons that really make a point stick. – jxramos Aug 18 '17 at 22:10
  • Python strings and/or lists have `find` or `index` methods that return -1 or error when the item isn't found. `numpy` arrays doesn't have anything quite the same. `nonzero` (`where`) returns all finds, which may be empty. – hpaulj Aug 19 '17 at 01:20
  • That's the crux of the matter, some of the [solutions](https://stackoverflow.com/a/29509282/1330381) are using operations not quite equivalent to find. The link here explicitly stated the `argmax` approach assumes the thing being sought after exists. Checking in advance first, which is what I had to wind up doing, before trusting the found argmax value. I guess with `find` you have to likewise check that the element was found but it's a familiar post op check. Now I now how to play argmax when applied to find operations, a bit disappointing the find idiom is not part of pandas. – jxramos Aug 19 '17 at 05:46
  • 1
    Another problem is that few of the numpy searches short-circuits. A few special cases do. In one more complex search for first 0, I got a big improvement with a custom `cython` function. – hpaulj Aug 19 '17 at 06:39
  • Yes, I was thinking about the lack of short-circuit in the pre-op check of seeing if the element is in the Series, that's tantamount to iterating over the Series twice, to check, then to get the index. My answer below shows the post-op check equivalent that prevents the double iteration, but it's a far cry from a short circuit. – jxramos Aug 23 '17 at 18:02

3 Answers3

8

From the source code:

In case of multiple occurrences of the maximum values, the indices corresponding to the first occurrence are returned.

In the case where the vector is all False, the max value is zero so the index of the first occurrence of the max value i.e. 0 is returned.

bphi
  • 3,115
  • 3
  • 23
  • 36
  • 1
    interesting, I was blindsided by thinking only along the lines of finding the first `True` and completely lost sight of the ordering taking place under the hood. Got to be careful in labeling functional behavior for an intended purpose with the actual general behavior its designed with. My code has to branch into separate logic when everything is False but that's ok, just got caught off guard. – jxramos Aug 18 '17 at 22:08
3

So at the end of the day it was a misinterpretation of argmax (which is a straightforward function), forgetting that False and True are values that have an order. I was blindsided to these realities in using argmax as a tool in service of to finding a specific element (an index to any True element) and expecting it to behave like a common find function with the common conventions of returning an empty list [], -1 for an index, or even None under the condition the element does not exit.

I wound up coding my ultimate solution as follows

s = pandas.Series( listOfBools )
idx = s.argmax()

if idx == s.index[0] and not s[idx] :
   return -1
return idx
jxramos
  • 7,356
  • 6
  • 57
  • 105
1

If you are using pandas, can you mask the boolean series with itself and then take the min or max of that series? This gives nan if there are no True values.

>>> s = pd.Series([False, False, True, False, True, False], 
                  index=[0, 1, 2, 3, 4, 5])
>>> s[s].index.max()
4
>>> s[s].index.min()
2
>>> s = pd.Series([False, False, False], index=[0,1,2])
>>> s[s].index.max()
nan
jondo
  • 21
  • 4