For each item in list L, find all of the corresponding items in a dataframe

Question

I'm looking for a fast solution to this Python problem:
- 'For each item in list L, find all of the corresponding items in a dataframe column (`df [ 'col1' ]).

The catch is that both L and df ['col1'] may contain duplicate values and all duplicates should be returned.

For example:

L = [1,4,1]
d = {'col1': [1,2,3,4,1,4,4], 'col2': ['a','b','c','d','e','f','g']}
df = pd.DataFrame(data=d)

The desired output would be a new DataFrame where df [ 'col1' ] contains the values:
[1,1,1,1,4,4,4]
and rows are duplicated accordingly. Note that 1 appears 4 times (twice in L * twice in df)

I have found that the obvious solutions like .isin() don't work because they drop duplicates.

A list comprehension does work, but it is too slow for my real-life problem, where len(df) = 16 million and len(L) = 150000):

idx = [y for x in L for y in df[df['col1'].values == x]]
res = df.loc[idx].reset_index(drop=True)

This is basically just a problem of comparing two lists (with a bit of dataframe indexing difficulty tacked on), and a clever and very fast solution by Mad Physicist almost works for this, except that duplicates in L are dropped (it returns [1, 4, 1, 4, 4] in the example above; i.e., it finds the duplicates in df but ignores the duplicates in L).

train = np.array([...]) # my df['col1']
keep = np.array([...]) # my list L
keep.sort()
ind = np.searchsorted(keep, train, side='left')
ind[ind == keep.size] -= 1
train_keep = train[keep[ind] == train]

I'd be grateful for any ideas.

score 1 · Accepted Answer · answered Jan 03 '20 at 23:04

1

Initial data:

L = [1,4,1]
df = pd.DataFrame({'col':[1,2,3,4,1,4,4] })

You can create dataframe from L

df2 = pd.DataFrame({'col':L})

and merge it with initial dataframe:

result = df.merge(df2, how='inner', on='col')
print(result)

Result:

answered Jan 03 '20 at 23:04

Stepan

1,044
1
5
9

This was extremely fast -- which makes sense since `merge()` is optimized code. Should have thought of that myself. Thanks! – John Jan 03 '20 at 23:42

Grzegorz Skibinski · Answer 2 · 2020-01-03T23:08:07.970

0

IIUC try:

L = [1,4,1]
pd.concat([df.loc[df['col'].eq(el), 'col'] for el in L], axis=0)

(Not sure how do you want to have indexes- the above will return a bit raw format)

Output:

0    1
4    1
3    4
5    4
6    4
0    1
4    1
Name: col, dtype: int64

Reindexed:

pd.concat([df.loc[df['col'].eq(el), 'col'] for el in L], axis=0).reset_index(drop=True)

#output:

0    1
1    1
2    4
3    4
4    4
5    1
6    1
Name: col, dtype: int64

edited Jan 03 '20 at 23:08

answered Jan 03 '20 at 22:59

Grzegorz Skibinski

12,624
2
11
34

This idea makes sense but it was very slow compared to merge(). – John Jan 03 '20 at 23:45
I was thinking about it - probably it would depend - what's the size of ```df``` and what's of ```L``` – Grzegorz Skibinski Jan 03 '20 at 23:48

For each item in list L, find all of the corresponding items in a dataframe

2 Answers2