12

I have a pandas frame similar to this one:

import pandas as pd
import numpy as np

data = {'Col1' : [4,5,6,7], 'Col2' : [10,20,30,40], 'Col3' : [100,50,-30,-50], 'Col4' : ['AAA', 'BBB', 'AAA', 'CCC']}

df = pd.DataFrame(data=data, index = ['R1','R2','R3','R4'])

    Col1  Col2  Col3 Col4
R1     4    10   100  AAA
R2     5    20    50  BBB
R3     6    30   -30  AAA
R4     7    40   -50  CCC

Given an array of targets:

target_array = np.array(['AAA', 'CCC', 'EEE'])

I would like to find the cell elements indices in Col4 which also appear in the target_array.

I have tried to find a documented answer but it seems beyond my skill... Anyone has any advice?

P.S. Incidentally, for this particular case I can input a target array whose elements are the data frame indices names array(['R1', 'R3', 'R5']). Would it be easier that way?

Edit 1:

Thank you very much for all the great replies. Sadly I can only choose one but everyone seems to point @Divakar as the best. Still you should look at piRSquared and MaxU speed comparisons for all the possibilities available

Delosari
  • 677
  • 2
  • 17
  • 29

5 Answers5

14

You can use NumPy's in1d -

df.index[np.in1d(df['Col4'],target_array)]

Explanation

1) Create a 1D mask corresponding to each row telling us whether there is a match between col4's element and any element in target_array :

mask = np.in1d(df['Col4'],target_array)

2) Use the mask to select valid indices from the dataframe as final output :

out = df.index[np.in1d(df['Col4'],target_array)]
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • @piRSquared Well I was hoping so, being a NumPy thing! ;) – Divakar Jun 28 '16 at 18:24
  • @piRSquared That's what I have figured generally between NumPy and pandas, when talking about built-ins that do similar operations. But with groupby operations, I have seen pandas having an upper hand. – Divakar Jun 28 '16 at 18:29
  • Thank you very much for the solution and the explanation: It is nice to understand each step :) – Delosari Jun 28 '16 at 19:54
11

This should do it:

df.loc[df.Col4.isin(target_array)].index

EDIT:

I ran three options: from selected answers. Mine, Bruce Pucci, and Divakar

enter image description here

Divakars was faster by a large amount. I'd pick his.

piRSquared
  • 285,575
  • 57
  • 475
  • 624
7

For the sake of completeness I've added two (.query() variants) - my timings against 400K rows df:

In [63]: df.shape
Out[63]: (400000, 4)

In [64]:  %timeit df.index[np.in1d(df['Col4'],target_array)]
10 loops, best of 3: 35.1 ms per loop

In [65]: %timeit df.index[df.Col4.isin(target_array)]
10 loops, best of 3: 36.7 ms per loop

In [66]: %timeit df.loc[df.Col4.isin(target_array)].index
10 loops, best of 3: 47.8 ms per loop

In [67]: %timeit df.query('@target_array.tolist() == Col4')
10 loops, best of 3: 45.7 ms per loop

In [68]: %timeit df.query('@target_array in Col4')
10 loops, best of 3: 51.9 ms per loop

Here is a similar comparison for (not in ...) and for different dtypes

Community
  • 1
  • 1
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
5
import pandas as pd
import numpy as np

data = {'Col1' : [4,5,6,7], 'Col2' : [10,20,30,40], 'Col3' : [100,50,-30,-50], 'Col4' : ['AAA', 'BBB', 'AAA', 'CCC']}
target_array = np.array(['AAA', 'CCC', 'EEE'])

df = pd.DataFrame(data=data, index = ['R1','R2','R3','R4'])

df['in_col'] = df['Col4'].apply(lambda x: x in target_array)

Is this what you were looking for? Then you can groupby the new column and query the True elements.

Jeff
  • 2,158
  • 1
  • 16
  • 29
  • Thank you very much for reminding my of the lambda: I am rather new to python and this is a very powerful/flexible tool – Delosari Jun 28 '16 at 19:50
4
df.index[df.Col4.isin(target_array)]
Bruce Pucci
  • 1,821
  • 2
  • 19
  • 26