Finding an array elements location in a pandas frame column (a.k.a pd.series)

Question

I have a pandas frame similar to this one:

import pandas as pd
import numpy as np

data = {'Col1' : [4,5,6,7], 'Col2' : [10,20,30,40], 'Col3' : [100,50,-30,-50], 'Col4' : ['AAA', 'BBB', 'AAA', 'CCC']}

df = pd.DataFrame(data=data, index = ['R1','R2','R3','R4'])

    Col1  Col2  Col3 Col4
R1     4    10   100  AAA
R2     5    20    50  BBB
R3     6    30   -30  AAA
R4     7    40   -50  CCC

Given an array of targets:

target_array = np.array(['AAA', 'CCC', 'EEE'])

I would like to find the cell elements indices in Col4 which also appear in the target_array.

I have tried to find a documented answer but it seems beyond my skill... Anyone has any advice?

P.S. Incidentally, for this particular case I can input a target array whose elements are the data frame indices names array(['R1', 'R3', 'R5']). Would it be easier that way?

Edit 1:

Thank you very much for all the great replies. Sadly I can only choose one but everyone seems to point @Divakar as the best. Still you should look at piRSquared and MaxU speed comparisons for all the possibilities available

Divakar · Accepted Answer · 2016-06-28T18:19:04.567

14

You can use NumPy's in1d -

df.index[np.in1d(df['Col4'],target_array)]

Explanation

1) Create a 1D mask corresponding to each row telling us whether there is a match between col4's element and any element in target_array :

mask = np.in1d(df['Col4'],target_array)

2) Use the mask to select valid indices from the dataframe as final output :

out = df.index[np.in1d(df['Col4'],target_array)]

edited Jun 28 '16 at 18:19

answered Jun 28 '16 at 18:11

Divakar

218,885
19
262
358

@piRSquared Well I was hoping so, being a NumPy thing! ;) – Divakar Jun 28 '16 at 18:24
@piRSquared That's what I have figured generally between NumPy and pandas, when talking about built-ins that do similar operations. But with groupby operations, I have seen pandas having an upper hand. – Divakar Jun 28 '16 at 18:29
Thank you very much for the solution and the explanation: It is nice to understand each step :) – Delosari Jun 28 '16 at 19:54

piRSquared · Answer 2 · 2016-06-28T18:25:50.457

11

This should do it:

df.loc[df.Col4.isin(target_array)].index

EDIT:

I ran three options: from selected answers. Mine, Bruce Pucci, and Divakar

Divakars was faster by a large amount. I'd pick his.

edited Jun 28 '16 at 18:25

answered Jun 28 '16 at 18:08

piRSquared

285,575
57
475
624

Thank you very much for the comparison, it is very neat. Just one question: Do you think the data type (str) is affecting the speed differently on each method? – Delosari Jun 28 '16 at 19:53
It changes things marginally. But the order stays the same. – piRSquared Jun 28 '16 at 19:56
That is good to know. Thank you very much for the reply again – Delosari Jun 28 '16 at 20:02

score 7 · Answer 3 · edited May 23 '17 at 10:30

For the sake of completeness I've added two (.query() variants) - my timings against 400K rows df:

In [63]: df.shape
Out[63]: (400000, 4)

In [64]:  %timeit df.index[np.in1d(df['Col4'],target_array)]
10 loops, best of 3: 35.1 ms per loop

In [65]: %timeit df.index[df.Col4.isin(target_array)]
10 loops, best of 3: 36.7 ms per loop

In [66]: %timeit df.loc[df.Col4.isin(target_array)].index
10 loops, best of 3: 47.8 ms per loop

In [67]: %timeit df.query('@target_array.tolist() == Col4')
10 loops, best of 3: 45.7 ms per loop

In [68]: %timeit df.query('@target_array in Col4')
10 loops, best of 3: 51.9 ms per loop

Here is a similar comparison for (not in ...) and for different dtypes

Thank you very much for the query options, it is a very nice discovery — Delosari, Jun 28 '16 at 19:49

score 5 · Answer 4 · answered Jun 28 '16 at 18:08

5

import pandas as pd
import numpy as np

data = {'Col1' : [4,5,6,7], 'Col2' : [10,20,30,40], 'Col3' : [100,50,-30,-50], 'Col4' : ['AAA', 'BBB', 'AAA', 'CCC']}
target_array = np.array(['AAA', 'CCC', 'EEE'])

df = pd.DataFrame(data=data, index = ['R1','R2','R3','R4'])

df['in_col'] = df['Col4'].apply(lambda x: x in target_array)

Is this what you were looking for? Then you can groupby the new column and query the True elements.

answered Jun 28 '16 at 18:08

Jeff

2,158
1
16
29

Thank you very much for reminding my of the lambda: I am rather new to python and this is a very powerful/flexible tool – Delosari Jun 28 '16 at 19:50

score 4 · Answer 5 · answered Jun 28 '16 at 18:09

4

df.index[df.Col4.isin(target_array)]

answered Jun 28 '16 at 18:09

Bruce Pucci

1,821
2
19
26

Thank you for the reply... it is a very neat approach since it uses only pandas functions – Delosari Jun 29 '16 at 15:18

Finding an array elements location in a pandas frame column (a.k.a pd.series)

5 Answers5

Linked