pandas, given list of values, find rows that has this value at column

Question

I have list of item ids [1, 2, 3, 4]

I want to find rows for those ids from another dataframe. Which row you pick when there are multiple choices doesn't matter, I just want any as fast as possible.

user_id, item_id

1, 2
1, 3
1, 2
4, 4
5, 4
2, 3
3, 1
3, 2

output (one of possible)

user_id, item_id
3, 1
3, 2
2, 3
4, 4

Currently, I am using item_ids.to_frame().merge(df, on='item_id', how='inner').drop_duplicates(subset=['item_id']) wonder if there's obvious better one?

Possible duplicate of [How to implement 'in' and 'not in' for Pandas dataframe](https://stackoverflow.com/questions/19960077/how-to-implement-in-and-not-in-for-pandas-dataframe) — Terry, Apr 06 '19 at 12:56

jezrael · Accepted Answer · 2019-04-06T13:41:04.303

First filter by Series.isin, then remove duplicates by DataFrame.drop_duplicates and last if necessary sorting:

L = [1, 2, 3, 4]

df = df[df['item_id'].isin(L)]
df = df.drop_duplicates('item_id', keep='last').sort_values('item_id')
print (df)
   user_id  item_id
6        3        1
7        3        2
5        2        3
4        5        4

Performance - isin vs query function in 10M rows:

np.random.seed(2019)

item_ids = [1, 2, 3, 4]

N = 10 ** 7
#1% matched values
df = pd.DataFrame({'item_id':np.random.choice(item_ids + [5], p=(.025,.025,.025,.025,.9),size=N)})

In [296]: %timeit df.query('item_id in {}'.format(item_ids))
284 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [297]: %timeit df[df['item_id'].isin(item_ids)]
174 ms ± 455 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

#50% matched values
df = pd.DataFrame({'item_id':np.random.choice(item_ids+ [5], p=(.125,.125,.125,.125,.5),size=N)})

In [299]: %timeit df.query('item_id in {}'.format(item_ids))
404 ms ± 5.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [300]: %timeit df[df['item_id'].isin(item_ids)]
299 ms ± 3.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#90% matched values
df = pd.DataFrame({'item_id':np.random.choice(item_ids+ [5], p=(.225,.225,.225,.225,.1),size=N)})

In [302]: %timeit df.query('item_id in {}'.format(item_ids))
480 ms ± 5.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [303]: %timeit df[df['item_id'].isin(item_ids)]
372 ms ± 2.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

How about `item_id.to_frame().merge(df, on='item_id', how='inner').drop_duplicates(subset=['item_id'])` , any difference? — eugene, Apr 06 '19 at 13:00
@eugene - Not sure if find it, but I remember some performance test - `isin` was faster like `merge`. But always the best test in real data. — jezrael, Apr 06 '19 at 13:04

score 1 · Answer 2 · answered Apr 06 '19 at 13:14

Query is much faster as it rely on the Numexpr package and supports fast vectorized operations

df=pd.DataFrame({'user_id':[1,1,1,4,5,2,3,3],'item_id':[2,3,2,4,4,3,1,2]})
item_ids = [1, 2, 3, 4]

df.query('item_id in {}'.format(item_ids)).drop_duplicates('item_id', keep='first')

Output:

    item_id user_id
0    2       1
1    3       1
3    4       4
6    1       3

pandas, given list of values, find rows that has this value at column

2 Answers2