2

I want to filter df1 based on a column of df2. I would only need to keep the rows in df1 if they appear in df2. I tried using isin() like so:

df1 = pd.DataFrame({'A' : [5,6,3,6,3,4]})

df2 = pd.DataFrame({'B' : [0,0,3,6,0,0]})

df1[df1['A'].isin(df2['B'])]

Which gives the desired df:

    A
    6
    3
    6
    3

However, my dataframes are very large (millions of rows) so this operation takes a significant amount of time. Are there other, more efficient ways to get the desired result?

apples-oranges
  • 959
  • 2
  • 9
  • 21
  • Maybe this answer will help you (https://stackoverflow.com/questions/17071871/select-rows-from-a-dataframe-based-on-values-in-a-column-in-pandas) it has example with **isin** but also without ;) – Damian Jan 27 '19 at 01:23
  • Are *both* your dataframes very large? Also, are your dataframes *equally* large or is one much larger than the other? It would be helpful if you created some sample input to benchmark against, e.g. `pd.DataFrame({'A': np.random.randint(0, 10, 10**6)})`. – jpp Jan 27 '19 at 01:31
  • You might want to consider sorting the rows in `df2` or better yet extracting the rows of `df2` into a numpy array and removing any duplicates then trying to take the intersection as you're doing above. I would avoid using `isin` as afaik it doesn't take advantage of sorted arrays or unique arrays. – duncster94 Jan 27 '19 at 01:31
  • @jpp Both are quite large I am afraid. `df1` is about 6 million rows, `df2` about 2 million. – apples-oranges Jan 27 '19 at 01:34
  • @duncster94 Interesting, I will try that. – apples-oranges Jan 27 '19 at 01:36
  • @apples-oranges Not sure why it'd take so long.. Do you have a data frame of integers, as your example above, or do you have of strings (or any other non-numeric object)? – rafaelc Jan 27 '19 at 02:25
  • @RafaelC Yes, you are correct, it's strings. Is there something I can try for strings? – apples-oranges Jan 27 '19 at 02:28
  • 1
    Yes. Use `sklearn.preprocessing.LabelEncoder` or any similar tool to map the strings to unique ids, and use `isin` in these `integer` dtyped columns. Make sure the `dtypes` are integers, not `object` – rafaelc Jan 27 '19 at 02:29

1 Answers1

2

What if you try to left join and then filter out NAs. I just generated two somewhat large data frames (10 mil one and 4 mil another) and on a average laptop with 8GB RAM it ran in seconds. The example is below. Hope it helps.

df1 = pd.DataFrame({'A' : range(10000000), "B": range(0, 20000000, 2)})
df2 = pd.DataFrame({'C' : range(4000000), "D": range(0, 8000000, 2)})
df = pd.merge(df1, df2, how="left", left_on="B", right_on="C")
df = df[df["C"].notnull()].copy()
griggy
  • 446
  • 4
  • 7
  • Thanks! However, in the resulting `df` I found some records in `df1` that were not in `df2` – apples-oranges Jan 27 '19 at 02:31
  • Sorry, actually since your filter is based on the values in df2 (in my example) you could try do the opposite join. df = pd.merge(df2, df1, how="left", left_on="D", right_on="B"). And in the merge use the appropriate columns. I think this one will work without problems. Cheers. – griggy Jan 27 '19 at 03:36