237

I know this is a very basic question but for some reason I can't find an answer. How can I get the index of certain element of a Series in python pandas? (first occurrence would suffice)

I.e., I'd like something like:

import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
print myseries.find(7) # should output 3

Certainly, it is possible to define such a method with a loop:

def find(s, el):
    for i in s.index:
        if s[i] == el: 
            return i
    return None

print find(myseries, 7)

but I assume there should be a better way. Is there?

sashkello
  • 17,306
  • 24
  • 81
  • 109

12 Answers12

291
>>> myseries[myseries == 7]
3    7
dtype: int64
>>> myseries[myseries == 7].index[0]
3

Though I admit that there should be a better way to do that, but this at least avoids iterating and looping through the object and moves it to the C level.

Jonathan Eunice
  • 21,653
  • 6
  • 75
  • 77
Viktor Kerkez
  • 45,070
  • 12
  • 104
  • 85
  • 16
    The trouble here is it assumes the element being searched for is actually in the list. It's a bummer pandas doesn't seem to have a built in find operation. – jxramos Aug 23 '17 at 17:16
  • 12
    This solution only works if your series has a sequential integer index. If your series index is by datetime, this doesn't work. – Andrew Medlin Jul 07 '18 at 11:45
62

Converting to an Index, you can use get_loc

In [1]: myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])

In [3]: Index(myseries).get_loc(7)
Out[3]: 3

In [4]: Index(myseries).get_loc(10)
KeyError: 10

Duplicate handling

In [5]: Index([1,1,2,2,3,4]).get_loc(2)
Out[5]: slice(2, 4, None)

Will return a boolean array if non-contiguous returns

In [6]: Index([1,1,2,1,3,2,4]).get_loc(2)
Out[6]: array([False, False,  True, False, False,  True, False], dtype=bool)

Uses a hashtable internally, so fast

In [7]: s = Series(randint(0,10,10000))

In [9]: %timeit s[s == 5]
1000 loops, best of 3: 203 µs per loop

In [12]: i = Index(s)

In [13]: %timeit i.get_loc(5)
1000 loops, best of 3: 226 µs per loop

As Viktor points out, there is a one-time creation overhead to creating an index (its incurred when you actually DO something with the index, e.g. the is_unique)

In [2]: s = Series(randint(0,10,10000))

In [3]: %timeit Index(s)
100000 loops, best of 3: 9.6 µs per loop

In [4]: %timeit Index(s).is_unique
10000 loops, best of 3: 140 µs per loop
Jeff
  • 125,376
  • 21
  • 220
  • 187
27

I'm impressed with all the answers here. This is not a new answer, just an attempt to summarize the timings of all these methods. I considered the case of a series with 25 elements and assumed the general case where the index could contain any values and you want the index value corresponding to the search value which is towards the end of the series.

Here are the speed tests on a 2012 Mac Mini in Python 3.9.10 with Pandas version 1.4.0.

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: data = [406400, 203200, 101600, 76100, 50800, 25400, 19050, 12700, 950
   ...: 0, 6700, 4750, 3350, 2360, 1700, 1180, 850, 600, 425, 300, 212, 150, 1
   ...: 06, 75, 53, 38]

In [4]: myseries = pd.Series(data, index=range(1,26))

In [5]: assert(myseries[21] == 150)

In [6]: %timeit myseries[myseries == 150].index[0]
179 µs ± 891 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [7]: %timeit myseries[myseries == 150].first_valid_index()
205 µs ± 3.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [8]: %timeit myseries.where(myseries == 150).first_valid_index()
597 µs ± 4.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [9]: %timeit myseries.index[np.where(myseries == 150)[0][0]]
110 µs ± 872 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [10]: %timeit pd.Series(myseries.index, index=myseries)[150]
125 µs ± 2.56 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [11]: %timeit myseries.index[pd.Index(myseries).get_loc(150)]
49.5 µs ± 814 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [12]: %timeit myseries.index[list(myseries).index(150)]
7.75 µs ± 36.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [13]: %timeit myseries.index[myseries.tolist().index(150)]
2.55 µs ± 27.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [14]: %timeit dict(zip(myseries.values, myseries.index))[150]
9.89 µs ± 79.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [15]: %timeit {v: k for k, v in myseries.items()}[150]
9.99 µs ± 67 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

@Jeff's answer seems to be the fastest - although it doesn't handle duplicates.

Correction: Sorry, I missed one, @Alex Spangher's solution using the list index method is by far the fastest.

Update: Added @EliadL's answer.

Hope this helps.

Amazing that such a simple operation requires such convoluted solutions and many are so slow. Over half a millisecond in some cases to find a value in a series of 25.

2022-02-18 Update

Updated all the timings with the latest Pandas version and Python 3.9. Even on an older computer, all the timings have significantly reduced (10 to 70%) compared to the previous tests (version 0.25.3).

Plus: Added two more methods utilizing dictionaries.

Bill
  • 10,323
  • 10
  • 62
  • 85
  • 1
    Thanks. But shouldn't you be measuring *after* `myindex` is created, since it only needs to be created once? – EliadL Jan 01 '20 at 23:24
  • You could argue that but it depends on how many look-ups like this are required. It's only worth creating the `myindex` series if you are going to do the look-up many times. For this test I assumed it was only needed once and the total execution time was important. – Bill Jan 02 '20 at 21:57
  • 1
    Just ran into the need to this this tonight, and using .get_lock() on the same Index object across multiple lookups seems like it should be the fastest. I think an improvement to the answer would be to provide the timings for both: including the Index creation, and another timing of only the lookup after it has been created. – Rick May 14 '20 at 02:28
  • Yes, good point. @EliadL also said that. It depends in how many applications the series is static. If any values in the series change, you need to rebuild `pd.Index(myseries)`. To be fair to the other methods I assumed the original series might have changed since the last lookup. – Bill May 14 '20 at 17:06
15
In [92]: (myseries==7).argmax()
Out[92]: 3

This works if you know 7 is there in advance. You can check this with (myseries==7).any()

Another approach (very similar to the first answer) that also accounts for multiple 7's (or none) is

In [122]: myseries = pd.Series([1,7,0,7,5], index=['a','b','c','d','e'])
In [123]: list(myseries[myseries==7].index)
Out[123]: ['b', 'd']
Alon
  • 761
  • 6
  • 7
  • The point about knowing 7 is an element in advance is right on. However using an `any` check is not ideal since a double iteration is needed. There's a cool post op check that will unveil all `False` conditions you can see [here](https://stackoverflow.com/a/45846361/1330381). – jxramos Aug 23 '17 at 18:07
  • 2
    Careful, if no element matches this condition, `argmax` will still return 0 (instead of erroring out). – cs95 Jan 23 '19 at 21:29
11

Another way to do this, although equally unsatisfying is:

s = pd.Series([1,3,0,7,5],index=[0,1,2,3,4])

list(s).index(7)

returns: 3

On time tests using a current dataset I'm working with (consider it random):

[64]:    %timeit pd.Index(article_reference_df.asset_id).get_loc('100000003003614')
10000 loops, best of 3: 60.1 µs per loop

In [66]: %timeit article_reference_df.asset_id[article_reference_df.asset_id == '100000003003614'].index[0]
1000 loops, best of 3: 255 µs per loop


In [65]: %timeit list(article_reference_df.asset_id).index('100000003003614')
100000 loops, best of 3: 14.5 µs per loop
Alex Spangher
  • 977
  • 2
  • 13
  • 22
7

If you use numpy, you can get an array of the indecies that your value is found:

import numpy as np
import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
np.where(myseries == 7)

This returns a one element tuple containing an array of the indecies where 7 is the value in myseries:

(array([3], dtype=int64),)
Alex
  • 2,154
  • 3
  • 26
  • 49
  • This is the best solution that I found. – Hadi Rohani Jun 03 '21 at 19:47
  • If using a dataframe, you can also use .values in combination with np.where / np.argwhere. To find the indices of all non-zero elements, it would be: np.argwhere(df['Column'].values) – Evan W. Apr 24 '22 at 16:13
5

you can use Series.idxmax()

>>> import pandas as pd
>>> myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
>>> myseries.idxmax()
3
>>> 
Raki Gade
  • 67
  • 1
  • 1
  • 6
    This appears to only return the index where the max element is found, not a specific `index of certain element` like the question asked. – jxramos May 30 '17 at 19:58
4

This is the most native and scalable approach I could find:

>>> myindex = pd.Series(myseries.index, index=myseries)

>>> myindex[7]
3

>>> myindex[[7, 5, 7]]
7    3
5    4
7    3
dtype: int64
EliadL
  • 6,230
  • 2
  • 26
  • 43
2

Another way to do it that hasn't been mentioned yet is the tolist method:

myseries.tolist().index(7)

should return the correct index, assuming the value exists in the Series.

rmutalik
  • 1,925
  • 3
  • 16
  • 20
  • 1
    @Alex Spangher suggested something similar on Sep 17 '14. See his answer. I have now added both versions to the test results. – Bill Jan 01 '20 at 19:59
1

Often your value occurs at multiple indices:

>>> myseries = pd.Series([0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1])
>>> myseries.index[myseries == 1]
Int64Index([3, 4, 5, 6, 10, 11], dtype='int64')
Ulf Aslak
  • 7,876
  • 4
  • 34
  • 56
1

The Pandas has builtin class Index with a function called get_loc. This function will either return

index (element index)
slice (if the specified number is in sequence)
array (bool array if the number is at multiple indexes)

Example:

import pandas as pd

>>> mySer = pd.Series([1, 3, 8, 10, 13])
>>> pd.Index(mySer).get_loc(10)  # Returns index
3  # Index of 10 in series

>>> mySer = pd.Series([1, 3, 8, 10, 10, 10, 13])
>>> pd.Index(mySer).get_loc(10)  # Returns slice
slice(3, 6, None)  # 10 occurs at index 3 (included) to 6 (not included)


# If the data is not in sequence then it would return an array of bool's.
>>> mySer = pd.Series([1, 10, 3, 8, 10, 10, 10, 13, 10])
>>> pd.Index(mySer).get_loc(10)
array([False, True, False, False, True, True, False, True])

There are many other options too but I found it very simple for me.

0

df.index method will help you to find the exact row number

my_fl2=(df['ConvertedCompYearly'] == 45241312 )
print (df[my_fl2].index)

   
Name: ConvertedCompYearly, dtype: float64
Int64Index([66910], dtype='int64')
salim ep
  • 35
  • 5