pandas get rows which are NOT in other dataframe

Question

I've two pandas data frames that have some rows in common.

Suppose dataframe2 is a subset of dataframe1.

How can I get the rows of dataframe1 which are not in dataframe2?

df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]}) 
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})

df1

   col1  col2
0     1    10
1     2    11
2     3    12
3     4    13
4     5    14

df2

   col1  col2
0     1    10
1     2    11
2     3    12

Expected result:

   col1  col2
3     4    13
4     5    14

@TedPetrou I fail to see how the answer you provided is the correct one. If I have two dataframes of which one is a subset of the other, I need to remove all those rows, which are in the subset. I don't want to remove duplicates. I completely want to remove the subset. — jukebox, May 16 '19 at 07:38
Possible duplicate of [dropping rows from dataframe based on a "not in" condition](https://stackoverflow.com/questions/27965295/dropping-rows-from-dataframe-based-on-a-not-in-condition) — Jim G., Sep 11 '19 at 18:33

score 384 · Answer 1 · edited Dec 14 '18 at 18:14

The currently selected solution produces incorrect results. To correctly solve this problem, we can perform a left-join from df1 to df2, making sure to first get just the unique rows for df2.

First, we need to modify the original DataFrame to add the row with data [3, 10].

df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 
                           'col2' : [10, 11, 12, 13, 14, 10]}) 
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3],
                           'col2' : [10, 11, 12]})

df1

   col1  col2
0     1    10
1     2    11
2     3    12
3     4    13
4     5    14
5     3    10

df2

   col1  col2
0     1    10
1     2    11
2     3    12

Perform a left-join, eliminating duplicates in df2 so that each row of df1 joins with exactly 1 row of df2. Use the parameter indicator to return an extra column indicating which table the row was from.

df_all = df1.merge(df2.drop_duplicates(), on=['col1','col2'], 
                   how='left', indicator=True)
df_all

   col1  col2     _merge
0     1    10       both
1     2    11       both
2     3    12       both
3     4    13  left_only
4     5    14  left_only
5     3    10  left_only

Create a boolean condition:

df_all['_merge'] == 'left_only'

0    False
1    False
2    False
3     True
4     True
5     True
Name: _merge, dtype: bool

Why other solutions are wrong

A few solutions make the same mistake - they only check that each value is independently in each column, not together in the same row. Adding the last row, which is unique but has the values from both columns from df2 exposes the mistake:

common = df1.merge(df2,on=['col1','col2'])
(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))
0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

This solution gets the same wrong result:

df1.isin(df2.to_dict('l')).all(1)

but, I suppose, they were assuming that the col1 is unique being an index (not mentioned in the question, but obvious) . So, if there is never such a case where there are two values of col2 for the same value of col1 (there can't be two col1=3 rows) the answers above are correct. — pashute, Nov 06 '17 at 08:38
It's certainly not obvious, so your point is invalid. My solution generalizes to more cases. — Ted Petrou, Nov 06 '17 at 13:54
Question, wouldn't it be easier to create a slice rather than a boolean array? Since the objective is to get the rows. — Matías Romo, Feb 20 '19 at 02:50
Use `df_all[df_all['_merge'] == 'left_only']` to have a df with the results — gies0r, May 15 '19 at 09:38
For the newly arrived, the addition of the extra row without explanation is confusing. Then @gies0r makes this solution better. Furthermore I'd suggest using `how='outer'` so that the `_merge` column has left/right/both which is more comprehendible when future readers try and apply the solution to their problems. — yeliabsalohcin, Sep 09 '21 at 14:46
@TedPetrou Why is `.drop_duplicates()` needed? I don't see that the DF had any DUP rows in it — Rahav, Nov 16 '22 at 19:00

EdChum · Accepted Answer · 2015-03-06T15:52:30.980

259

One method would be to store the result of an inner merge form both dfs, then we can simply select the rows when one column's values are not in this common:

In [119]:

common = df1.merge(df2,on=['col1','col2'])
print(common)
df1[(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))]
   col1  col2
0     1    10
1     2    11
2     3    12
Out[119]:
   col1  col2
3     4    13
4     5    14

EDIT

Another method as you've found is to use isin which will produce NaN rows which you can drop:

In [138]:

df1[~df1.isin(df2)].dropna()
Out[138]:
   col1  col2
3     4    13
4     5    14

However if df2 does not start rows in the same manner then this won't work:

df2 = pd.DataFrame(data = {'col1' : [2, 3,4], 'col2' : [11, 12,13]})

will produce the entire df:

In [140]:

df1[~df1.isin(df2)].dropna()
Out[140]:
   col1  col2
0     1    10
1     2    11
2     3    12
3     4    13
4     5    14

edited Mar 06 '15 at 15:52

answered Mar 06 '15 at 15:35

EdChum

376,765
198
813
562

28

`df1[~df1.isin(df2)].dropna(how = 'all')` seems to do the trick. Thanks anyway - your answer helped me to find a solution. – think nice things Mar 06 '15 at 15:48
2

Would you care to explain what `~` does in your code `df1[~df1.isin(df2)]` please? Can't google anything out of it since it's just a symbol. Thanks. – Bowen Liu Oct 29 '18 at 16:03
5

@BowenLiu it negates the expression, basically it says select all that are NOT IN instead of IN. – Vega Aug 24 '20 at 11:25
4

@thinknicethings, it could be simpler: `df1[~df1.index.isin(df2.index)]` – Gill Bates Jun 05 '21 at 09:13

score 121 · Answer 3 · answered Jun 01 '17 at 23:56

121

Assuming that the indexes are consistent in the dataframes (not taking into account the actual col values):

df1[~df1.index.isin(df2.index)]

answered Jun 01 '17 at 23:56

Dennis Golomazov

16,269
5
73
81

Rune Lyngsoe · Answer 4 · 2018-01-11T16:50:44.597

As already hinted at, isin requires columns and indices to be the same for a match. If match should only be on row contents, one way to get the mask for filtering the rows present is to convert the rows to a (Multi)Index:

In [77]: df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 'col2' : [10, 11, 12, 13, 14, 10]})
In [78]: df2 = pandas.DataFrame(data = {'col1' : [1, 3, 4], 'col2' : [10, 12, 13]})
In [79]: df1.loc[~df1.set_index(list(df1.columns)).index.isin(df2.set_index(list(df2.columns)).index)]
Out[79]:
   col1  col2
1     2    11
4     5    14
5     3    10

If index should be taken into account, set_index has keyword argument append to append columns to existing index. If columns do not line up, list(df.columns) can be replaced with column specifications to align the data.

pandas.MultiIndex.from_tuples(df<N>.to_records(index = False).tolist())

could alternatively be used to create the indices, though I doubt this is more efficient.

this is really useful and efficient. The previous options did not work for my data. Thank you! — John Perez, Oct 09 '20 at 14:38
Thank you for this! This is the example that worked perfectly for me. — Adam DS, Apr 05 '21 at 23:05

score 12 · Answer 5 · edited Dec 17 '15 at 12:43

12

Suppose you have two dataframes, df_1 and df_2 having multiple fields(column_names) and you want to find the only those entries in df_1 that are not in df_2 on the basis of some fields(e.g. fields_x, fields_y), follow the following steps.

Step1.Add a column key1 and key2 to df_1 and df_2 respectively.

Step2.Merge the dataframes as shown below. field_x and field_y are our desired columns.

Step3.Select only those rows from df_1 where key1 is not equal to key2.

Step4.Drop key1 and key2.

This method will solve your problem and works fast even with big data sets. I have tried it for dataframes with more than 1,000,000 rows.

df_1['key1'] = 1
df_2['key2'] = 1
df_1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'left')
df_1 = df_1[~(df_1.key2 == df_1.key1)]
df_1 = df_1.drop(['key1','key2'], axis=1)

edited Dec 17 '15 at 12:43

Jon Surrell

9,444
8
48
54

answered Dec 17 '15 at 11:50

Pragalbh kulshrestha

169
2
10

I don't think this is technically what he wants - he wants to know which rows were unique to which df. but, I think this solution returns a df of rows that were either unique to the first df or the second df. – MetaStack Aug 30 '16 at 20:37
Why do you need key1 and key2=1?? You could use field_x and field_y as well – ranemak Sep 27 '22 at 08:14

score 10 · Answer 6 · answered Aug 18 '20 at 23:23

This is the best way to do it:

df = df1.drop_duplicates().merge(df2.drop_duplicates(), on=df2.columns.to_list(), 
                   how='left', indicator=True)
df.loc[df._merge=='left_only',df.columns!='_merge']

Note that drop duplicated is used to minimize the comparisons. It would work without them as well. The best way is to compare the row contents themselves and not the index or one/two columns and same code can be used for other filters like 'both' and 'right_only' as well to achieve similar results. For this syntax dataframes can have any number of columns and even different indices. Only the columns should occur in both the dataframes.

Why this is the best way?

index.difference only works for unique index based comparisons
pandas.concat() coupled with drop_duplicated() is not ideal because it will also get rid of the rows which may be only in the dataframe you want to keep and are duplicated for valid reasons.

score 9 · Answer 7 · answered May 31 '21 at 14:28

I think those answers containing merging are extremely slow. Therefore I would suggest another way of getting those rows which are different between the two dataframes:

df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]}) 
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})

DISCLAIMER: My solution works if you're interested in one specific column where the two dataframes differ. If you are interested only in those rows, where all columns are equal do not use this approach.

Let's say, col1 is a kind of ID, and you only want to get those rows, which are not contained in both dataframes:

ids_in_df2 = df2.col1.unique()
not_found_ids = df[~df['col1'].isin(ids_in_df2 )]

And that's it. You get a dataframe containing only those rows where col1 isn't appearent in both dataframes.

score 8 · Answer 8 · edited May 23 '17 at 11:47

8

a bit late, but it might be worth checking the "indicator" parameter of pd.merge.

See this other question for an example: Compare PandaS DataFrames and return rows that are missing from the first one

edited May 23 '17 at 11:47

Community

1
1

answered Feb 02 '17 at 14:15

jabellcu

692
8
20

Yes! Also here: https://stackoverflow.com/questions/49487263/pandas-left-join-where-right-is-null-on-multiple-columns?noredirect=1&lq=1 – Dan Apr 03 '19 at 07:00

score 5 · Answer 9 · edited Jun 21 '18 at 11:19

5

You can also concat df1, df2:

x = pd.concat([df1, df2])

and then remove all duplicates:

y = x.drop_duplicates(keep=False, inplace=False)

edited Jun 21 '18 at 11:19

Mr. T

11,960
10
32
54

answered Feb 16 '18 at 08:49

Semeon Balagula

101
2
3

9

This will return all data that is in either set, not just the data that is only in df1. – Jamie Marshall Jul 30 '18 at 17:30

score 4 · Answer 10 · answered Aug 22 '21 at 14:51

I have an easier way in 2 simple steps: As the OP mentioned Suppose dataframe2 is a subset of dataframe1, columns in the 2 dataframes are the same,

df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 
                           'col2' : [10, 11, 12, 13, 14, 10]}) 
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3],
                           'col2' : [10, 11, 12]})

### Step 1: just append the 2nd df at the end of the 1st df 
df_both = df1.append(df2)

### Step 2: drop rows which contain duplicates, Drop all duplicates.
df_dif = df_both.drop_duplicates(keep=False)

## mission accompliched!
df_dif
Out[20]: 
   col1  col2
3     4    13
4     5    14
5     3    10

score 3 · Answer 11 · answered Aug 30 '16 at 22:28

you can do it using isin(dict) method:

In [74]: df1[~df1.isin(df2.to_dict('l')).all(1)]
Out[74]:
   col1  col2
3     4    13
4     5    14

Explanation:

In [75]: df2.to_dict('l')
Out[75]: {'col1': [1, 2, 3], 'col2': [10, 11, 12]}

In [76]: df1.isin(df2.to_dict('l'))
Out[76]:
    col1   col2
0   True   True
1   True   True
2   True   True
3  False  False
4  False  False

In [77]: df1.isin(df2.to_dict('l')).all(1)
Out[77]:
0     True
1     True
2     True
3    False
4    False
dtype: bool

As Ted Petrou pointed out this solution leads to wrong results which I can confirm. — the_economist, Oct 27 '20 at 13:56

Sergey Zakharov · Answer 12 · 2018-04-29T16:02:34.823

3

Here is another way of solving this:

df1[~df1.index.isin(df1.merge(df2, how='inner', on=['col1', 'col2']).index)]

Or:

df1.loc[df1.index.difference(df1.merge(df2, how='inner', on=['col1', 'col2']).index)]

edited Apr 29 '18 at 16:02

answered Apr 29 '18 at 15:51

Sergey Zakharov

1,493
3
21
40

score 3 · Answer 13 · edited Sep 30 '22 at 20:31

3

extract the dissimilar rows using the merge function

df = df1.merge(df2.drop_duplicates(), on=['col1','col2'], 
               how='left', indicator=True)

save the dissimilar rows in CSV

df[df['_merge'] == 'left_only'].to_csv('output.csv')

edited Sep 30 '22 at 20:31

ljmc

4,830
2
7
26

answered Apr 16 '20 at 06:08

Gajanan Kothawade

35
6

score 1 · Answer 14 · answered Mar 26 '17 at 18:19

My way of doing this involves adding a new column that is unique to one dataframe and using this to choose whether to keep an entry

df2[col3] = 1
df1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'outer')
df1['Empt'].fillna(0, inplace=True)

This makes it so every entry in df1 has a code - 0 if it is unique to df1, 1 if it is in both dataFrames. You then use this to restrict to what you want

answer = nonuni[nonuni['Empt'] == 0]

adamwlev · Answer 15 · 2017-08-02T23:01:14.633

1

How about this:

df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 
                               'col2' : [10, 11, 12, 13, 14]}) 
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3], 
                               'col2' : [10, 11, 12]})
records_df2 = set([tuple(row) for row in df2.values])
in_df2_mask = np.array([tuple(row) in records_df2 for row in df1.values])
result = df1[~in_df2_mask]

edited Aug 02 '17 at 23:01

answered Jul 29 '17 at 05:26

adamwlev

97
2
7

score 1 · Answer 16 · answered Apr 30 '21 at 03:13

1

Easier, simpler and elegant

uncommon_indices = np.setdiff1d(df1.index.values, df2.index.values)
new_df = df1.loc[uncommon_indices,:]

answered Apr 30 '21 at 03:13

MNK

634
4
18

score -1 · Answer 17 · answered Jan 27 '23 at 09:49

pd.concat([df1, df2]).drop_duplicates(keep=False) will concatenate the two DataFrames together, and then drop all the duplicates, keeping only the unique rows. By default it will keep the first occurrence of the duplicate, but setting keep=False will drop all the duplicates.

Keep in mind that if you need to compare the DataFrames with columns with different names, you will have to make sure the columns have the same name before concatenating the dataframes.

Also, if the dataframes have a different order of columns, it will also affect the final result.

pandas get rows which are NOT in other dataframe

17 Answers17

Why other solutions are wrong

This is the best way to do it:

Why this is the best way?

Linked

Related