255

I have two data frames df1 and df2, where df2 is a subset of df1. How do I get a new data frame (df3) which is the difference between the two data frames?

In other word, a data frame that has all the rows/columns in df1 that are not in df2?

enter image description here

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
userPyGeo
  • 3,631
  • 4
  • 14
  • 24
  • 6
    The easiest way to do this will depend on how your dataframes are structured (i.e. whether the indexes can be used, etc.). This is a good example of why you should always include a [reproducible example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) in pandas questions. – cmaher Feb 06 '18 at 16:30
  • 1
    I have added the dataframe sample image – userPyGeo Feb 06 '18 at 16:33
  • similar to https://stackoverflow.com/q/20225110 – SpeedCoder5 Jun 07 '18 at 13:48

19 Answers19

331

By using drop_duplicates

pd.concat([df1,df2]).drop_duplicates(keep=False)

Update :

The above method only works for those data frames that don't already have duplicates themselves. For example:

df1=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
df2=pd.DataFrame({'A':[1],'B':[2]})

It will output like below , which is wrong

Wrong Output :

pd.concat([df1, df2]).drop_duplicates(keep=False)
Out[655]: 
   A  B
1  2  3

Correct Output

Out[656]: 
   A  B
1  2  3
2  3  4
3  3  4

How to achieve that?

Method 1: Using isin with tuple

df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
Out[657]: 
   A  B
1  2  3
2  3  4
3  3  4

Method 2: merge with indicator

df1.merge(df2,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
Out[421]: 
   A  B     _merge
1  2  3  left_only
2  3  4  left_only
3  3  4  left_only
bryant1410
  • 5,540
  • 4
  • 39
  • 40
BENY
  • 317,841
  • 20
  • 164
  • 234
  • 15
    You could also determine which columns are to be considered, when looking for duplicates: `pd.concat([df1,df2]).drop_duplicates(subset = ['col1','col2'], keep=False)` – Szpaqn Jan 08 '19 at 08:15
  • 2
    @Szpaqn notice this method will not handle the special case . :-) – BENY Jan 09 '19 at 17:10
  • Note that this may cause unexpected rows to remain in the result if one of your data types is `float` (because `12.00000000001 != 12`). A better practice is to find the set intersection of the IDs in two data frames and get the difference based on that. – Jiageng Sep 03 '19 at 01:49
  • This could be outdated. MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False – DtechNet Oct 16 '19 at 19:21
  • 1
    @DtechNet you need to make two data frame have the same name – BENY Oct 16 '19 at 20:14
  • I think Method 2 can be made more efficient by replacing the lambda function with a slicer: ```df = df1.merge(df2,indicator = True, how='left') df = df.loc[df['_merge']!='both']``` – hamx0r Jan 24 '20 at 12:54
  • 8
    Method 2 (`indicator=True`) is a very versatile and useful tool, I'd love to see it at the top of this answer, but with 'outer' not 'left' join to cover all 3 situations. – mirekphd May 02 '20 at 11:07
  • 4
    Could you kindly explain the meaning of `apply(tuple,1)` ? – liangli May 29 '20 at 04:22
  • what if you have duplicates on both sides? e.g. three (3,4) on the left and two (3,4) on the right? – Jayen Mar 07 '21 at 11:27
  • @liangli changing the whole selected columns to tuple for each row – BENY Jun 01 '21 at 14:54
  • does not work if the second df has delta rows. – Emre Bayram Jun 18 '21 at 14:03
  • @Jayen then I would recommend the first solution – BENY Jun 18 '21 at 14:06
  • @BENY: Could you kindly look at [this](https://stackoverflow.com/questions/74304935/comparing-two-dataframes-and-isolating-similar-data-rows-between-them-using-pa) please? – carla Nov 04 '22 at 04:37
  • This answer is only correct for the symmetric difference, not the set / relational difference. – CodingTil Nov 24 '22 at 20:05
73

For rows, try this, where Name is the joint index column (can be a list for multiple common columns, or specify left_on and right_on):

m = df1.merge(df2, on='Name', how='outer', suffixes=['', '_'], indicator=True)

The indicator=True setting is useful as it adds a column called _merge, with all changes between df1 and df2, categorized into 3 possible kinds: "left_only", "right_only" or "both".

For columns, try this:

set(df1.columns).symmetric_difference(df2.columns)
jpp
  • 159,742
  • 34
  • 281
  • 339
  • 13
    Downvoter care to comment? `merge` with `indicator=True` is the classic solution for comparing dataframes by given fields. – jpp Jan 08 '19 at 09:17
23

Accepted answer Method 1 will not work for data frames with NaNs inside, as pd.np.nan != pd.np.nan. I am not sure if this is the best way, but it can be avoided by

df1[~df1.astype(str).apply(tuple, 1).isin(df2.astype(str).apply(tuple, 1))]

It's slower, because it needs to cast data to string, but thanks to this casting pd.np.nan == pd.np.nan.

Let's go trough the code. First we cast values to string, and apply tuple function to each row.

df1.astype(str).apply(tuple, 1)
df2.astype(str).apply(tuple, 1)

Thanks to that, we get pd.Series object with list of tuples. Each tuple contains whole row from df1/df2. Then we apply isin method on df1 to check if each tuple "is in" df2. The result is pd.Series with bool values. True if tuple from df1 is in df2. In the end, we negate results with ~ sign, and applying filter on df1. Long story short, we get only those rows from df1 that are not in df2.

To make it more readable, we may write it as:

df1_str_tuples = df1.astype(str).apply(tuple, 1)
df2_str_tuples = df2.astype(str).apply(tuple, 1)
df1_values_in_df2_filter = df1_str_tuples.isin(df2_str_tuples)
df1_values_not_in_df2 = df1[~df1_values_in_df2_filter]
toecsnar42
  • 413
  • 3
  • 13
  • This is a great answer but it is incomprehensible as an one-liner. If one separates each step and understands what it does it becomes very clear how it gets the job done. – rioZg Feb 10 '21 at 09:52
  • Added explanation. I hope it helps! – toecsnar42 Feb 17 '21 at 22:19
16
import pandas as pd
# given
df1 = pd.DataFrame({'Name':['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa',],
    'Age':[23,45,12,34,27,44,28,39,40]})
df2 = pd.DataFrame({'Name':['John','Smith','Wale','Tom','Menda','Yuswa',],
    'Age':[23,12,34,44,28,40]})

# find elements in df1 that are not in df2
df_1notin2 = df1[~(df1['Name'].isin(df2['Name']) & df1['Age'].isin(df2['Age']))].reset_index(drop=True)

# output:
print('df1\n', df1)
print('df2\n', df2)
print('df_1notin2\n', df_1notin2)

# df1
#     Age   Name
# 0   23   John
# 1   45   Mike
# 2   12  Smith
# 3   34   Wale
# 4   27  Marry
# 5   44    Tom
# 6   28  Menda
# 7   39   Bolt
# 8   40  Yuswa
# df2
#     Age   Name
# 0   23   John
# 1   12  Smith
# 2   34   Wale
# 3   44    Tom
# 4   28  Menda
# 5   40  Yuswa
# df_1notin2
#     Age   Name
# 0   45   Mike
# 1   27  Marry
# 2   39   Bolt
SpeedCoder5
  • 8,188
  • 6
  • 33
  • 34
  • What does '~' mean? – Piotrek Leśniak Aug 12 '20 at 14:29
  • '~' is not for boolean indexing. See: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-indexing – SpeedCoder5 Aug 15 '20 at 23:58
  • @SpeedCoder5: This works for me, except that the resultant DF is comprised of row/s that do not exist in the (compared) DF. However, using this and a **change in the process**, I could achieve my goal. Thanks. My original query [here](https://stackoverflow.com/questions/74304935/comparing-two-dataframes-and-isolating-similar-data-rows-between-them-using-pa). – carla Nov 04 '22 at 07:35
  • Tilde '~' means not, applying here to isin() to be like an isnotin() which doesn't exist yet – gseattle Mar 26 '23 at 15:25
13

Pandas now offers a new API to do data frame diff: pandas.DataFrame.compare

df.compare(df2)
  col1       col3
  self other self other
0    a     c  NaN   NaN
2  NaN   NaN  3.0   4.0
Yushan ZHANG
  • 523
  • 5
  • 18
8

edit2, I figured out a new solution without the need of setting index

newdf=pd.concat([df1,df2]).drop_duplicates(keep=False)

Okay i found the answer of highest vote already contain what I have figured out. Yes, we can only use this code on condition that there are no duplicates in each two dfs.


I have a tricky method. First we set ’Name’ as the index of two dataframe given by the question. Since we have same ’Name’ in two dfs, we can just drop the ’smaller’ df’s index from the ‘bigger’ df. Here is the code.

df1.set_index('Name',inplace=True)
df2.set_index('Name',inplace=True)
newdf=df1.drop(df2.index)
VirtualScooter
  • 1,792
  • 3
  • 18
  • 28
liangli
  • 1,151
  • 9
  • 11
8

Perhaps a simpler one-liner, with identical or different column names. Worked even when df2['Name2'] contained duplicate values.

newDf = df1.set_index('Name1')
           .drop(df2['Name2'], errors='ignore')
           .reset_index(drop=False)
MrE
  • 19,584
  • 12
  • 87
  • 105
Cherif Diallo
  • 311
  • 2
  • 3
  • 2
    simple and effective. Added errors='ignore' to resolve issue for the case where the destination values are not in the source (i.e. intersection) and resetting the index in the end brings a df that is similar to the original. – MrE Feb 03 '20 at 03:50
7

There is a new method in pandas DataFrame.compare that compare 2 different dataframes and return which values changed in each column for the data records.

Example

First Dataframe

Id Customer Status      Date
1      ABC   Good  Mar 2023
2      BAC   Good  Feb 2024
3      CBA    Bad  Apr 2022

Second Dataframe

Id Customer Status      Date
1      ABC    Bad  Mar 2023
2      BAC   Good  Feb 2024
5      CBA   Good  Apr 2024

Comparing Dataframes

print("Dataframe difference -- \n")
print(df1.compare(df2))

print("Dataframe difference keeping equal values -- \n")
print(df1.compare(df2, keep_equal=True))

print("Dataframe difference keeping same shape -- \n")
print(df1.compare(df2, keep_shape=True))

print("Dataframe difference keeping same shape and equal values -- \n")
print(df1.compare(df2, keep_shape=True, keep_equal=True))

Result

Dataframe difference -- 

    Id       Status            Date          
  self other   self other      self     other
0  NaN   NaN   Good   Bad       NaN       NaN
2  3.0   5.0    Bad  Good  Apr 2022  Apr 2024

Dataframe difference keeping equal values -- 

    Id       Status            Date          
  self other   self other      self     other
0    1     1   Good   Bad  Mar 2023  Mar 2023
2    3     5    Bad  Good  Apr 2022  Apr 2024

Dataframe difference keeping same shape -- 

    Id       Customer       Status            Date          
  self other     self other   self other      self     other
0  NaN   NaN      NaN   NaN   Good   Bad       NaN       NaN
1  NaN   NaN      NaN   NaN    NaN   NaN       NaN       NaN
2  3.0   5.0      NaN   NaN    Bad  Good  Apr 2022  Apr 2024

Dataframe difference keeping same shape and equal values -- 

    Id       Customer       Status            Date          
  self other     self other   self other      self     other
0    1     1      ABC   ABC   Good   Bad  Mar 2023  Mar 2023
1    2     2      BAC   BAC   Good  Good  Feb 2024  Feb 2024
2    3     5      CBA   CBA    Bad  Good  Apr 2022  Apr 2024
Oghli
  • 2,200
  • 1
  • 15
  • 37
4

In addition to accepted answer, I would like to propose one more wider solution that can find a 2D set difference of two dataframes with any index/columns (they might not coincide for both datarames). Also method allows to setup tolerance for float elements for dataframe comparison (it uses np.isclose)


import numpy as np
import pandas as pd

def get_dataframe_setdiff2d(df_new: pd.DataFrame, 
                            df_old: pd.DataFrame, 
                            rtol=1e-03, atol=1e-05) -> pd.DataFrame:
    """Returns set difference of two pandas DataFrames"""

    union_index = np.union1d(df_new.index, df_old.index)
    union_columns = np.union1d(df_new.columns, df_old.columns)

    new = df_new.reindex(index=union_index, columns=union_columns)
    old = df_old.reindex(index=union_index, columns=union_columns)

    mask_diff = ~np.isclose(new, old, rtol, atol)

    df_bool = pd.DataFrame(mask_diff, union_index, union_columns)

    df_diff = pd.concat([new[df_bool].stack(),
                         old[df_bool].stack()], axis=1)

    df_diff.columns = ["New", "Old"]

    return df_diff

Example:

In [1]

df1 = pd.DataFrame({'A':[2,1,2],'C':[2,1,2]})
df2 = pd.DataFrame({'A':[1,1],'B':[1,1]})

print("df1:\n", df1, "\n")

print("df2:\n", df2, "\n")

diff = get_dataframe_setdiff2d(df1, df2)

print("diff:\n", diff, "\n")
Out [1]

df1:
   A  C
0  2  2
1  1  1
2  2  2 

df2:
   A  B
0  1  1
1  1  1 

diff:
     New  Old
0 A  2.0  1.0
  B  NaN  1.0
  C  2.0  NaN
1 B  NaN  1.0
  C  1.0  NaN
2 A  2.0  NaN
  C  2.0  NaN 
Luchko
  • 1,123
  • 7
  • 15
3

As mentioned here that

df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]

is correct solution but it will produce wrong output if

df1=pd.DataFrame({'A':[1],'B':[2]})
df2=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})

In that case above solution will give Empty DataFrame, instead you should use concat method after removing duplicates from each datframe.

Use concate with drop_duplicates

df1=df1.drop_duplicates(keep="first") 
df2=df2.drop_duplicates(keep="first") 
pd.concat([df1,df2]).drop_duplicates(keep=False)
Arun Pal
  • 687
  • 7
  • 28
  • 1
    The author of the question asked to return all values in df1 that are not in df2. Therefore, the `df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]` is the correct answer even in this case. If you wanted to obtain values that are either in df1 or df2 but not both, then your suggested approach is correct (with the caveat of removing duplicates from original dataframes). – ira Nov 28 '20 at 14:04
3

I had issues with handling duplicates when there were duplicates on one side and at least one on the other side, so I used Counter.collections to do a better diff, ensuring both sides have the same count. This doesn't return duplicates, but it won't return any if both sides have the same count.

from collections import Counter

def diff(df1, df2, on=None):
    """
    :param on: same as pandas.df.merge(on) (a list of columns)
    """
    on = on if on else df1.columns
    df1on = df1[on]
    df2on = df2[on]
    c1 = Counter(df1on.apply(tuple, 'columns'))
    c2 = Counter(df2on.apply(tuple, 'columns'))
    c1c2 = c1-c2
    c2c1 = c2-c1
    df1ondf2on = pd.DataFrame(list(c1c2.elements()), columns=on)
    df2ondf1on = pd.DataFrame(list(c2c1.elements()), columns=on)
    df1df2 = df1.merge(df1ondf2on).drop_duplicates(subset=on)
    df2df1 = df2.merge(df2ondf1on).drop_duplicates(subset=on)
    return pd.concat([df1df2, df2df1])
> df1 = pd.DataFrame({'a': [1, 1, 3, 4, 4]})
> df2 = pd.DataFrame({'a': [1, 2, 3, 4, 4]})
> diff(df1, df2)
   a
0  1
0  2
Jayen
  • 5,653
  • 2
  • 44
  • 65
3

Symmetric Difference

If you are interested in the rows that are only in one of the dataframes but not both, you are looking for the set difference:

pd.concat([df1,df2]).drop_duplicates(keep=False)

⚠️ Only works, if both dataframes do not contain any duplicates.

Set Difference / Relational Algebra Difference

If you are interested in the relational algebra difference / set difference, i.e. df1-df2 or df1\df2:

pd.concat([df1,df2,df2]).drop_duplicates(keep=False) 

⚠️ Only works, if both dataframes do not contain any duplicates.

CodingTil
  • 447
  • 2
  • 16
1

A slight variation of the nice @liangli's solution that does not require to change the index of existing dataframes:

newdf = df1.drop(df1.join(df2.set_index('Name').index))
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
1

Finding difference by index. Assuming df1 is a subset of df2 and the indexes are carried forward when subsetting

df1.loc[set(df1.index).symmetric_difference(set(df2.index))].dropna()

# Example

df1 = pd.DataFrame({"gender":np.random.choice(['m','f'],size=5), "subject":np.random.choice(["bio","phy","chem"],size=5)}, index = [1,2,3,4,5])

df2 =  df1.loc[[1,3,5]]

df1

 gender subject
1      f     bio
2      m    chem
3      f     phy
4      m     bio
5      f     bio

df2

  gender subject
1      f     bio
3      f     phy
5      f     bio

df3 = df1.loc[set(df1.index).symmetric_difference(set(df2.index))].dropna()

df3

  gender subject
2      m    chem
4      m     bio

DOS
  • 11
  • 2
1

Defining our dataframes:

df1 = pd.DataFrame({
    'Name':
        ['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa'],
    'Age':
        [23,45,12,34,27,44,28,39,40]
})

df2 = df1[df1.Name.isin(['John','Smith','Wale','Tom','Menda','Yuswa'])

df1

    Name  Age
0   John   23
1   Mike   45
2  Smith   12
3   Wale   34
4  Marry   27
5    Tom   44
6  Menda   28
7   Bolt   39
8  Yuswa   40

df2

    Name  Age
0   John   23
2  Smith   12
3   Wale   34
5    Tom   44
6  Menda   28
8  Yuswa   40

The difference between the two would be:

df1[~df1.isin(df2)].dropna()

    Name   Age
1   Mike  45.0
4  Marry  27.0
7   Bolt  39.0

Where:

  • df1.isin(df2) returns the rows in df1 that are also in df2.
  • ~ (Element-wise logical NOT) in front of the expression negates the results, so we get the elements in df1 that are NOT in df2–the difference between the two.
  • .dropna() drops the rows with NaN presenting the desired output

Note This only works if len(df1) >= len(df2). If df2 is longer than df1 you can reverse the expression: df2[~df2.isin(df1)].dropna()

Iopheam
  • 1,065
  • 10
  • 11
1

I found the deepdiff library is a wonderful tool that also extends well to dataframes if different detail is required or ordering matters. You can experiment with diffing to_dict('records'), to_numpy(), and other exports:

import pandas as pd
from deepdiff import DeepDiff

df1 = pd.DataFrame({
    'Name':
        ['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa'],
    'Age':
        [23,45,12,34,27,44,28,39,40]
})

df2 = df1[df1.Name.isin(['John','Smith','Wale','Tom','Menda','Yuswa'])]

DeepDiff(df1.to_dict(), df2.to_dict())
# {'dictionary_item_removed': [root['Name'][1], root['Name'][4], root['Name'][7], root['Age'][1], root['Age'][4], root['Age'][7]]}
ntjess
  • 570
  • 6
  • 10
1

Another possible solution is to use numpy broadcasting:

df1[np.all(~np.all(df1.values == df2.values[:, None], axis=2), axis=0)]

Output:

    Name  Age
1   Mike   45
4  Marry   27
7   Bolt   39
PaulS
  • 21,159
  • 2
  • 9
  • 26
0

Using the lambda function you can filter the rows with _merge value “left_only” to get all the rows in df1 which are missing from df2

df3 = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x :x['_merge']=='left_only']
df
SherylHohman
  • 16,580
  • 17
  • 88
  • 94
curiousBrain
  • 39
  • 1
  • 7
  • Smart, you can also use ```.query("_merge == 'left_only'")```instead of the lambda in your loc. ```df1.merge(df2, how = 'outer' ,indicator=True).query("_merge == 'left_only'")"``` – dimButTries Mar 06 '22 at 21:49
0

Try this one: df_new = df1.merge(df2, how='outer', indicator=True).query('_merge == "left_only"').drop('_merge', 1)

It will result a new dataframe with the differences: the values that exist in df1 but not in df2.