Filter dataframe based on column of other df without isin()

Question

I want to filter df1 based on a column of df2. I would only need to keep the rows in df1 if they appear in df2. I tried using isin() like so:

df1 = pd.DataFrame({'A' : [5,6,3,6,3,4]})

df2 = pd.DataFrame({'B' : [0,0,3,6,0,0]})

df1[df1['A'].isin(df2['B'])]

Which gives the desired df:

However, my dataframes are very large (millions of rows) so this operation takes a significant amount of time. Are there other, more efficient ways to get the desired result?

Maybe this answer will help you (https://stackoverflow.com/questions/17071871/select-rows-from-a-dataframe-based-on-values-in-a-column-in-pandas) it has example with **isin** but also without ;) — Damian, Jan 27 '19 at 01:23
Are *both* your dataframes very large? Also, are your dataframes *equally* large or is one much larger than the other? It would be helpful if you created some sample input to benchmark against, e.g. `pd.DataFrame({'A': np.random.randint(0, 10, 10**6)})`. — jpp, Jan 27 '19 at 01:31
You might want to consider sorting the rows in `df2` or better yet extracting the rows of `df2` into a numpy array and removing any duplicates then trying to take the intersection as you're doing above. I would avoid using `isin` as afaik it doesn't take advantage of sorted arrays or unique arrays. — duncster94, Jan 27 '19 at 01:31
@jpp Both are quite large I am afraid. `df1` is about 6 million rows, `df2` about 2 million. — apples-oranges, Jan 27 '19 at 01:34
@apples-oranges Not sure why it'd take so long.. Do you have a data frame of integers, as your example above, or do you have of strings (or any other non-numeric object)? — rafaelc, Jan 27 '19 at 02:25
@RafaelC Yes, you are correct, it's strings. Is there something I can try for strings? — apples-oranges, Jan 27 '19 at 02:28
Yes. Use `sklearn.preprocessing.LabelEncoder` or any similar tool to map the strings to unique ids, and use `isin` in these `integer` dtyped columns. Make sure the `dtypes` are integers, not `object` — rafaelc, Jan 27 '19 at 02:29

score 2 · Accepted Answer · answered Jan 27 '19 at 02:04

2

What if you try to left join and then filter out NAs. I just generated two somewhat large data frames (10 mil one and 4 mil another) and on a average laptop with 8GB RAM it ran in seconds. The example is below. Hope it helps.

df1 = pd.DataFrame({'A' : range(10000000), "B": range(0, 20000000, 2)})
df2 = pd.DataFrame({'C' : range(4000000), "D": range(0, 8000000, 2)})
df = pd.merge(df1, df2, how="left", left_on="B", right_on="C")
df = df[df["C"].notnull()].copy()

answered Jan 27 '19 at 02:04

griggy

446
4
7

Thanks! However, in the resulting `df` I found some records in `df1` that were not in `df2` – apples-oranges Jan 27 '19 at 02:31
Sorry, actually since your filter is based on the values in df2 (in my example) you could try do the opposite join. df = pd.merge(df2, df1, how="left", left_on="D", right_on="B"). And in the merge use the appropriate columns. I think this one will work without problems. Cheers. – griggy Jan 27 '19 at 03:36

Filter dataframe based on column of other df without isin()

1 Answers1