106

I have a dataframe df1 which looks like:

   c  k  l
0  A  1  a
1  A  2  b
2  B  2  a
3  C  2  a
4  C  2  d

and another called df2 like:

   c  l
0  A  b
1  C  a

I would like to filter df1 keeping only the values that ARE NOT in df2. Values to filter are expected to be as (A,b) and (C,a) tuples. So far I tried to apply the isin method:

d = df[~(df['l'].isin(dfc['l']) & df['c'].isin(dfc['c']))]

That seems to me too complicated, it returns:

   c  k  l
2  B  2  a
4  C  2  d

but I'm expecting:

   c  k  l
0  A  1  a
2  B  2  a
4  C  2  d
Fabio Lamanna
  • 20,504
  • 24
  • 90
  • 122
  • 1
    How about concatenating the values of the two columns `c` and `l` and using this as your key? – IanS Oct 22 '15 at 13:34

7 Answers7

121

You can do this efficiently using isin on a multiindex constructed from the desired columns:

df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
                    'k': [1, 2, 2, 2, 2],
                    'l': ['a', 'b', 'a', 'a', 'd']})
df2 = pd.DataFrame({'c': ['A', 'C'],
                    'l': ['b', 'a']})
keys = list(df2.columns.values)
i1 = df1.set_index(keys).index
i2 = df2.set_index(keys).index
df1[~i1.isin(i2)]

enter image description here

I think this improves on @IanS's similar solution because it doesn't assume any column type (i.e. it will work with numbers as well as strings).


(Above answer is an edit. Following was my initial answer)

Interesting! This is something I haven't come across before... I would probably solve it by merging the two arrays, then dropping rows where df2 is defined. Here is an example, which makes use of a temporary array:

df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
                    'k': [1, 2, 2, 2, 2],
                    'l': ['a', 'b', 'a', 'a', 'd']})
df2 = pd.DataFrame({'c': ['A', 'C'],
                    'l': ['b', 'a']})

# create a column marking df2 values
df2['marker'] = 1

# join the two, keeping all of df1's indices
joined = pd.merge(df1, df2, on=['c', 'l'], how='left')
joined

enter image description here

# extract desired columns where marker is NaN
joined[pd.isnull(joined['marker'])][df1.columns]

enter image description here

There may be a way to do this without using the temporary array, but I can't think of one. As long as your data isn't huge the above method should be a fast and sufficient answer.

alpha_989
  • 4,882
  • 2
  • 37
  • 48
jakevdp
  • 77,104
  • 11
  • 125
  • 160
  • Thanks for the attribution :) I think you should make your edit a new answer, hopefully the accepted one. I would definitely vote for it! – IanS Oct 22 '15 at 14:14
  • I changed the edit to be the primary answer. Thanks! – jakevdp Oct 22 '15 at 14:31
  • Nice method! I thought it was easier to do it, thanks to all for your help! – Fabio Lamanna Oct 22 '15 at 14:45
  • @jakevdp, Thanks, I tried it out and it works great for this case. I have a slightly more complicated schenario where `df2 = pd.DataFrame({'c': ['A', *], 'l': [*, 'a']})`, by `*` I mean a wildcard, so the value can be anything. So the output of `df1[~i1.isin(i2)]` should be: `pd.DataFrame({'c': ['C'], 'k': [2], 'l': ['d']})`. Is this possible to achieve by modifiying the above? – alpha_989 Jul 05 '18 at 19:30
  • 1
    Your initial answer creates a marker column, but pd.merge() now contains a parameter which is 'indicator'. If you would choose indicator=True, then an extra column is added (called '_merge') which is a marker by itself on the newly created merged df. You would then filter on joined['_merge']=='left_only'. – Sander van den Oord Nov 29 '18 at 09:13
  • thanks alot i would vote you up but no more votes... – adir abargil Dec 01 '20 at 14:53
  • This is great, exactly what I was looking for. Curious though: why cant I just compare the two dataframes directly? In my case, I have two dataframes with identical columns, but potentially unequal indices and length. I want to compare matching rows regardless of index. This solutions works, but it seems like extra steps when `df1[~df1.isin(df2)]` or even `df1[~df[keys].isin(df2[keys])]` should reach the same result. (spoiler, they dont) – Matt Aug 09 '22 at 21:29
  • This was a great answer for my use case where users were 'checking off' items in a dataframe using another list. The column headers may or may not match and with this method you can define the index using different keys for the source and target dataframes. Very nice. – Alex Sep 21 '22 at 15:48
43

This is pretty succinct and works well:

df1 = df1[~df1.index.isin(df2.index)]
Rahul Agarwal
  • 4,034
  • 7
  • 27
  • 51
Haroon Hassan
  • 455
  • 4
  • 3
  • 14
    While this code may answer the question, providing additional context regarding how and/or why it solves the problem would improve the answer's long-term value. Please read this [how-to-answer](http://stackoverflow.com/help/how-to-answer) for providing quality answer. – thewaywewere Jun 10 '17 at 13:26
  • 11
    This only works because the example data is alligned, if the keys were not alligned, this would fail in every sense. Not sure how this got 14 upvotes. – Erfan Apr 05 '20 at 14:56
28

Using DataFrame.merge & DataFrame.query:

A more elegant method would be to do left join with the argument indicator=True, then filter all the rows which are left_only with query:

d = (
    df1.merge(df2, 
              on=['c', 'l'],
              how='left', 
              indicator=True)
    .query('_merge == "left_only"')
    .drop(columns='_merge')
)

print(d)
   c  k  l
0  A  1  a
2  B  2  a
4  C  2  d

indicator=True returns a dataframe with an extra column _merge which marks each row left_only, both, right_only:

df1.merge(df2, on=['c', 'l'], how='left', indicator=True)

   c  k  l     _merge
0  A  1  a  left_only
1  A  2  b       both
2  B  2  a  left_only
3  C  2  a       both
4  C  2  d  left_only
Erfan
  • 40,971
  • 8
  • 66
  • 78
3

I think this is a quite simple approach when you want to filter a dataframe based on multiple columns from another dataframe or even based on a custom list.

df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
                    'k': [1, 2, 2, 2, 2],
                    'l': ['a', 'b', 'a', 'a', 'd']})
df2 = pd.DataFrame({'c': ['A', 'C'],
                    'l': ['b', 'a']})

#values of df2 columns 'c' and 'l' that will be used to filter df1
idxs = list(zip(df2.c.values, df2.l.values)) #[('A', 'b'), ('C', 'a')]

#so df1 is filtered based on the values present in columns c and l of df2 (idxs)
df1 = df1[pd.Series(list(zip(df1.c, df1.l)), index=df1.index).isin(idxs)]
dasilvadaniel
  • 413
  • 4
  • 8
2

How about:

df1['key'] = df1['c'] + df1['l']
d = df1[~df1['key'].isin(df2['c'] + df2['l'])].drop(['key'], axis=1)
IanS
  • 15,771
  • 9
  • 60
  • 84
  • 1
    I think your answer would be stronger with some more information. Could you [edit] this answer to include some information about why someone should use this approach, or at least what this code accomplishes? If you can't come up with anything to elaborate on, consider a scenario: If I copy and paste your code blindly into my application, are there any edge cases I should be worried about? When should I avoid using this approach? – theB Oct 22 '15 at 20:31
0

Another option that avoids creating an extra column or doing a merge would be to do a groupby on df2 to get the distinct (c, l) pairs and then just filter df1 using that.

gb = df2.groupby(("c", "l")).groups
df1[[p not in gb for p in zip(df1['c'], df1['l'])]]]

For this small example, it actually seems to run a bit faster than the pandas-based approach (666 µs vs. 1.76 ms on my machine), but I suspect it could be slower on larger examples since it's dropping into pure Python.

Randy
  • 14,349
  • 2
  • 36
  • 42
0

You can concatenate both DataFrames and drop all duplicates:

df1.append(df2).drop_duplicates(subset=['c', 'l'], keep=False)

Output:

   c    k  l
0  A  1.0  a
2  B  2.0  a
4  C  2.0  d

This method doesn't work if you have duplicates subset=['c', 'l'] in df1.

Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73