406

I've two pandas data frames that have some rows in common.

Suppose dataframe2 is a subset of dataframe1.

How can I get the rows of dataframe1 which are not in dataframe2?

df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]}) 
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})

df1

   col1  col2
0     1    10
1     2    11
2     3    12
3     4    13
4     5    14

df2

   col1  col2
0     1    10
1     2    11
2     3    12

Expected result:

   col1  col2
3     4    13
4     5    14
Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
think nice things
  • 4,315
  • 3
  • 14
  • 12
  • 2
    @TedPetrou I fail to see how the answer you provided is the correct one. If I have two dataframes of which one is a subset of the other, I need to remove all those rows, which are in the subset. I don't want to remove duplicates. I completely want to remove the subset. – jukebox May 16 '19 at 07:38
  • Possible duplicate of [dropping rows from dataframe based on a "not in" condition](https://stackoverflow.com/questions/27965295/dropping-rows-from-dataframe-based-on-a-not-in-condition) – Jim G. Sep 11 '19 at 18:33

17 Answers17

384

The currently selected solution produces incorrect results. To correctly solve this problem, we can perform a left-join from df1 to df2, making sure to first get just the unique rows for df2.

First, we need to modify the original DataFrame to add the row with data [3, 10].

df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 
                           'col2' : [10, 11, 12, 13, 14, 10]}) 
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3],
                           'col2' : [10, 11, 12]})

df1

   col1  col2
0     1    10
1     2    11
2     3    12
3     4    13
4     5    14
5     3    10

df2

   col1  col2
0     1    10
1     2    11
2     3    12

Perform a left-join, eliminating duplicates in df2 so that each row of df1 joins with exactly 1 row of df2. Use the parameter indicator to return an extra column indicating which table the row was from.

df_all = df1.merge(df2.drop_duplicates(), on=['col1','col2'], 
                   how='left', indicator=True)
df_all

   col1  col2     _merge
0     1    10       both
1     2    11       both
2     3    12       both
3     4    13  left_only
4     5    14  left_only
5     3    10  left_only

Create a boolean condition:

df_all['_merge'] == 'left_only'

0    False
1    False
2    False
3     True
4     True
5     True
Name: _merge, dtype: bool

Why other solutions are wrong

A few solutions make the same mistake - they only check that each value is independently in each column, not together in the same row. Adding the last row, which is unique but has the values from both columns from df2 exposes the mistake:

common = df1.merge(df2,on=['col1','col2'])
(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))
0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

This solution gets the same wrong result:

df1.isin(df2.to_dict('l')).all(1)
purple_dot
  • 106
  • 2
  • 9
Ted Petrou
  • 59,042
  • 19
  • 131
  • 136
  • 2
    but, I suppose, they were assuming that the col1 is unique being an index (not mentioned in the question, but obvious) . So, if there is never such a case where there are two values of col2 for the same value of col1 (there can't be two col1=3 rows) the answers above are correct. – pashute Nov 06 '17 at 08:38
  • 26
    It's certainly not obvious, so your point is invalid. My solution generalizes to more cases. – Ted Petrou Nov 06 '17 at 13:54
  • Question, wouldn't it be easier to create a slice rather than a boolean array? Since the objective is to get the rows. – Matías Romo Feb 20 '19 at 02:50
  • 23
    Use `df_all[df_all['_merge'] == 'left_only']` to have a df with the results – gies0r May 15 '19 at 09:38
  • 4
    For the newly arrived, the addition of the extra row without explanation is confusing. Then @gies0r makes this solution better. Furthermore I'd suggest using `how='outer'` so that the `_merge` column has left/right/both which is more comprehendible when future readers try and apply the solution to their problems. – yeliabsalohcin Sep 09 '21 at 14:46
  • Is it possible to get a count of "left-only"? – x89 Sep 15 '21 at 09:30
  • @TedPetrou Why is `.drop_duplicates()` needed? I don't see that the DF had any DUP rows in it – Rahav Nov 16 '22 at 19:00
259

One method would be to store the result of an inner merge form both dfs, then we can simply select the rows when one column's values are not in this common:

In [119]:

common = df1.merge(df2,on=['col1','col2'])
print(common)
df1[(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))]
   col1  col2
0     1    10
1     2    11
2     3    12
Out[119]:
   col1  col2
3     4    13
4     5    14

EDIT

Another method as you've found is to use isin which will produce NaN rows which you can drop:

In [138]:

df1[~df1.isin(df2)].dropna()
Out[138]:
   col1  col2
3     4    13
4     5    14

However if df2 does not start rows in the same manner then this won't work:

df2 = pd.DataFrame(data = {'col1' : [2, 3,4], 'col2' : [11, 12,13]})

will produce the entire df:

In [140]:

df1[~df1.isin(df2)].dropna()
Out[140]:
   col1  col2
0     1    10
1     2    11
2     3    12
3     4    13
4     5    14
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • 28
    `df1[~df1.isin(df2)].dropna(how = 'all')` seems to do the trick. Thanks anyway - your answer helped me to find a solution. – think nice things Mar 06 '15 at 15:48
  • 2
    Would you care to explain what `~` does in your code `df1[~df1.isin(df2)]` please? Can't google anything out of it since it's just a symbol. Thanks. – Bowen Liu Oct 29 '18 at 16:03
  • 5
    @BowenLiu it negates the expression, basically it says select all that are NOT IN instead of IN. – Vega Aug 24 '20 at 11:25
  • 4
    @thinknicethings, it could be simpler: `df1[~df1.index.isin(df2.index)]` – Gill Bates Jun 05 '21 at 09:13
121

Assuming that the indexes are consistent in the dataframes (not taking into account the actual col values):

df1[~df1.index.isin(df2.index)]
Dennis Golomazov
  • 16,269
  • 5
  • 73
  • 81
17

As already hinted at, isin requires columns and indices to be the same for a match. If match should only be on row contents, one way to get the mask for filtering the rows present is to convert the rows to a (Multi)Index:

In [77]: df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 'col2' : [10, 11, 12, 13, 14, 10]})
In [78]: df2 = pandas.DataFrame(data = {'col1' : [1, 3, 4], 'col2' : [10, 12, 13]})
In [79]: df1.loc[~df1.set_index(list(df1.columns)).index.isin(df2.set_index(list(df2.columns)).index)]
Out[79]:
   col1  col2
1     2    11
4     5    14
5     3    10

If index should be taken into account, set_index has keyword argument append to append columns to existing index. If columns do not line up, list(df.columns) can be replaced with column specifications to align the data.

pandas.MultiIndex.from_tuples(df<N>.to_records(index = False).tolist())

could alternatively be used to create the indices, though I doubt this is more efficient.

Rune Lyngsoe
  • 666
  • 1
  • 7
  • 12
12

Suppose you have two dataframes, df_1 and df_2 having multiple fields(column_names) and you want to find the only those entries in df_1 that are not in df_2 on the basis of some fields(e.g. fields_x, fields_y), follow the following steps.

Step1.Add a column key1 and key2 to df_1 and df_2 respectively.

Step2.Merge the dataframes as shown below. field_x and field_y are our desired columns.

Step3.Select only those rows from df_1 where key1 is not equal to key2.

Step4.Drop key1 and key2.

This method will solve your problem and works fast even with big data sets. I have tried it for dataframes with more than 1,000,000 rows.

df_1['key1'] = 1
df_2['key2'] = 1
df_1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'left')
df_1 = df_1[~(df_1.key2 == df_1.key1)]
df_1 = df_1.drop(['key1','key2'], axis=1)
Jon Surrell
  • 9,444
  • 8
  • 48
  • 54
  • I don't think this is technically what he wants - he wants to know which rows were unique to which df. but, I think this solution returns a df of rows that were either unique to the first df or the second df. – MetaStack Aug 30 '16 at 20:37
  • Why do you need key1 and key2=1?? You could use field_x and field_y as well – ranemak Sep 27 '22 at 08:14
10

This is the best way to do it:

df = df1.drop_duplicates().merge(df2.drop_duplicates(), on=df2.columns.to_list(), 
                   how='left', indicator=True)
df.loc[df._merge=='left_only',df.columns!='_merge']

Note that drop duplicated is used to minimize the comparisons. It would work without them as well. The best way is to compare the row contents themselves and not the index or one/two columns and same code can be used for other filters like 'both' and 'right_only' as well to achieve similar results. For this syntax dataframes can have any number of columns and even different indices. Only the columns should occur in both the dataframes.

Why this is the best way?

  1. index.difference only works for unique index based comparisons
  2. pandas.concat() coupled with drop_duplicated() is not ideal because it will also get rid of the rows which may be only in the dataframe you want to keep and are duplicated for valid reasons.
Hamza
  • 5,373
  • 3
  • 28
  • 43
9

I think those answers containing merging are extremely slow. Therefore I would suggest another way of getting those rows which are different between the two dataframes:

df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]}) 
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})

DISCLAIMER: My solution works if you're interested in one specific column where the two dataframes differ. If you are interested only in those rows, where all columns are equal do not use this approach.

Let's say, col1 is a kind of ID, and you only want to get those rows, which are not contained in both dataframes:

ids_in_df2 = df2.col1.unique()
not_found_ids = df[~df['col1'].isin(ids_in_df2 )]

And that's it. You get a dataframe containing only those rows where col1 isn't appearent in both dataframes.

lschmidt90
  • 358
  • 3
  • 6
8

a bit late, but it might be worth checking the "indicator" parameter of pd.merge.

See this other question for an example: Compare PandaS DataFrames and return rows that are missing from the first one

Community
  • 1
  • 1
jabellcu
  • 692
  • 8
  • 20
  • Yes! Also here: https://stackoverflow.com/questions/49487263/pandas-left-join-where-right-is-null-on-multiple-columns?noredirect=1&lq=1 – Dan Apr 03 '19 at 07:00
5

You can also concat df1, df2:

x = pd.concat([df1, df2])

and then remove all duplicates:

y = x.drop_duplicates(keep=False, inplace=False)
Mr. T
  • 11,960
  • 10
  • 32
  • 54
Semeon Balagula
  • 101
  • 2
  • 3
4

I have an easier way in 2 simple steps: As the OP mentioned Suppose dataframe2 is a subset of dataframe1, columns in the 2 dataframes are the same,

df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 
                           'col2' : [10, 11, 12, 13, 14, 10]}) 
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3],
                           'col2' : [10, 11, 12]})

### Step 1: just append the 2nd df at the end of the 1st df 
df_both = df1.append(df2)

### Step 2: drop rows which contain duplicates, Drop all duplicates.
df_dif = df_both.drop_duplicates(keep=False)

## mission accompliched!
df_dif
Out[20]: 
   col1  col2
3     4    13
4     5    14
5     3    10
neutralname
  • 383
  • 2
  • 4
  • 11
3

you can do it using isin(dict) method:

In [74]: df1[~df1.isin(df2.to_dict('l')).all(1)]
Out[74]:
   col1  col2
3     4    13
4     5    14

Explanation:

In [75]: df2.to_dict('l')
Out[75]: {'col1': [1, 2, 3], 'col2': [10, 11, 12]}

In [76]: df1.isin(df2.to_dict('l'))
Out[76]:
    col1   col2
0   True   True
1   True   True
2   True   True
3  False  False
4  False  False

In [77]: df1.isin(df2.to_dict('l')).all(1)
Out[77]:
0     True
1     True
2     True
3    False
4    False
dtype: bool
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
3

Here is another way of solving this:

df1[~df1.index.isin(df1.merge(df2, how='inner', on=['col1', 'col2']).index)]

Or:

df1.loc[df1.index.difference(df1.merge(df2, how='inner', on=['col1', 'col2']).index)]
Sergey Zakharov
  • 1,493
  • 3
  • 21
  • 40
3

extract the dissimilar rows using the merge function

df = df1.merge(df2.drop_duplicates(), on=['col1','col2'], 
               how='left', indicator=True)

save the dissimilar rows in CSV

df[df['_merge'] == 'left_only'].to_csv('output.csv')
ljmc
  • 4,830
  • 2
  • 7
  • 26
1

My way of doing this involves adding a new column that is unique to one dataframe and using this to choose whether to keep an entry

df2[col3] = 1
df1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'outer')
df1['Empt'].fillna(0, inplace=True)

This makes it so every entry in df1 has a code - 0 if it is unique to df1, 1 if it is in both dataFrames. You then use this to restrict to what you want

answer = nonuni[nonuni['Empt'] == 0]
r.rz
  • 11
  • 1
1

How about this:

df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 
                               'col2' : [10, 11, 12, 13, 14]}) 
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3], 
                               'col2' : [10, 11, 12]})
records_df2 = set([tuple(row) for row in df2.values])
in_df2_mask = np.array([tuple(row) in records_df2 for row in df1.values])
result = df1[~in_df2_mask]
adamwlev
  • 97
  • 2
  • 7
1

Easier, simpler and elegant

uncommon_indices = np.setdiff1d(df1.index.values, df2.index.values)
new_df = df1.loc[uncommon_indices,:]
MNK
  • 634
  • 4
  • 18
-1

pd.concat([df1, df2]).drop_duplicates(keep=False) will concatenate the two DataFrames together, and then drop all the duplicates, keeping only the unique rows. By default it will keep the first occurrence of the duplicate, but setting keep=False will drop all the duplicates.

Keep in mind that if you need to compare the DataFrames with columns with different names, you will have to make sure the columns have the same name before concatenating the dataframes.

Also, if the dataframes have a different order of columns, it will also affect the final result.

chubercik
  • 534
  • 1
  • 5
  • 13