First filter by Series.isin
, then remove duplicates by DataFrame.drop_duplicates
and last if necessary sorting:
L = [1, 2, 3, 4]
df = df[df['item_id'].isin(L)]
df = df.drop_duplicates('item_id', keep='last').sort_values('item_id')
print (df)
user_id item_id
6 3 1
7 3 2
5 2 3
4 5 4
Performance - isin
vs query
function in 10M
rows:
np.random.seed(2019)
item_ids = [1, 2, 3, 4]
N = 10 ** 7
#1% matched values
df = pd.DataFrame({'item_id':np.random.choice(item_ids + [5], p=(.025,.025,.025,.025,.9),size=N)})
In [296]: %timeit df.query('item_id in {}'.format(item_ids))
284 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [297]: %timeit df[df['item_id'].isin(item_ids)]
174 ms ± 455 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#50% matched values
df = pd.DataFrame({'item_id':np.random.choice(item_ids+ [5], p=(.125,.125,.125,.125,.5),size=N)})
In [299]: %timeit df.query('item_id in {}'.format(item_ids))
404 ms ± 5.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [300]: %timeit df[df['item_id'].isin(item_ids)]
299 ms ± 3.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#90% matched values
df = pd.DataFrame({'item_id':np.random.choice(item_ids+ [5], p=(.225,.225,.225,.225,.1),size=N)})
In [302]: %timeit df.query('item_id in {}'.format(item_ids))
480 ms ± 5.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [303]: %timeit df[df['item_id'].isin(item_ids)]
372 ms ± 2.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)