Comparing two dataframes and getting the differences

Question

I have two dataframes. Example:

df1:
Date       Fruit  Num  Color 
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange  8.6 Orange
2013-11-24 Apple   7.6 Green
2013-11-24 Celery 10.2 Green

df2:
Date       Fruit  Num  Color 
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange  8.6 Orange
2013-11-24 Apple   7.6 Green
2013-11-24 Celery 10.2 Green
2013-11-25 Apple  22.1 Red
2013-11-25 Orange  8.6 Orange

Each dataframe has the Date as an index. Both dataframes have the same structure.

What i want to do, is compare these two dataframes and find which rows are in df2 that aren't in df1. I want to compare the date (index) and the first column (Banana, APple, etc) to see if they exist in df2 vs df1.

I have tried the following:

For the first approach I get this error: "Exception: Can only compare identically-labeled DataFrame objects". I have tried removing the Date as index but get the same error.

On the third approach, I get the assert to return False but cannot figure out how to actually see the different rows.

Any pointers would be welcome

If you do this: http://www.cookbook-r.com/Manipulating_data/Renaming_columns_in_a_data_frame/, will it get rid of the 'identically-labeled DataFrame objects' exception? — Anthony Kong, Nov 26 '13 at 18:35
I've changed column names many times to try to get around the issue with no luck. — Eric D. Brown D.Sc., Nov 26 '13 at 18:46
FWIW, I changed column names to be "a,b,c,d" on both dataframes and receive the same error message. — Eric D. Brown D.Sc., Nov 26 '13 at 19:09

score 136 · Accepted Answer · answered Nov 26 '13 at 21:14

136

This approach, df1 != df2, works only for dataframes with identical rows and columns. In fact, all dataframes axes are compared with _indexed_same method, and exception is raised if differences found, even in columns/indices order.

If I got you right, you want not to find changes, but symmetric difference. For that, one approach might be concatenate dataframes:

>>> df = pd.concat([df1, df2])
>>> df = df.reset_index(drop=True)

group by

>>> df_gpby = df.groupby(list(df.columns))

get index of unique records

>>> idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]

filter

>>> df.reindex(idx)
         Date   Fruit   Num   Color
9  2013-11-25  Orange   8.6  Orange
8  2013-11-25   Apple  22.1     Red

answered Nov 26 '13 at 21:14

alko

46,136
12
94
102

This was the answer. I removed the "Date" index and followed this approach and I get right output. – Eric D. Brown D.Sc. Nov 26 '13 at 21:43
13

Is there an easy way to add a flag to this to see which rows were removed/added/changed from df1 to df2? – pyCthon Nov 23 '15 at 20:07
@alko I was wondering, does this `pd.concat` add in only the missing items from the `df1`? Or does it replace `df1` completely with `df2`? – jake wong Feb 20 '16 at 17:26
@jakewong `pd.concat` - as used here - does an outer join. In other words, it joins all indices from both df's and this is in fact the default behaviour for `pd.concat()`, here's the docs http://pandas.pydata.org/pandas-docs/stable/merging.html – Thanos Apr 17 '16 at 18:35
what is the maximum number of records we can compare using pandas ? – Pyd Jan 31 '18 at 09:29
Using this approach, how do we find out which row is missing from which dataframe. Is there a way to get that info ? – Naxi Apr 19 '21 at 13:33
For people looking at this answer, just go to: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html – asemprini87 Jul 29 '21 at 17:25
Can you not simply concat the two data-frames e.g. df = pd.concat([df1, df2], ignore_index = True) And then do df.drop_duplicates(['Date', 'Fruit']) ? That would give you the the rows that are not in common between df1 and df2, assuming that Date and Fruit are the correct keys for your comparison. – Carl Aug 04 '21 at 10:31
df.drop_duplicates(['Date', 'Fruit'], keep = False) – Carl Aug 04 '21 at 11:08

score 96 · Answer 2 · edited Jan 05 '21 at 02:26

Updating and placing, somewhere it will be easier for others to find, ling's comment upon jur's response above.

df_diff = pd.concat([df1,df2]).drop_duplicates(keep=False)

Testing with these DataFrames:

# with import pandas as pd

df1 = pd.DataFrame({
    'Date':['2013-11-24','2013-11-24','2013-11-24','2013-11-24'],
    'Fruit':['Banana','Orange','Apple','Celery'],
    'Num':[22.1,8.6,7.6,10.2],
    'Color':['Yellow','Orange','Green','Green'],
    })

df2 = pd.DataFrame({
    'Date':['2013-11-24','2013-11-24','2013-11-24','2013-11-24','2013-11-25','2013-11-25'],
    'Fruit':['Banana','Orange','Apple','Celery','Apple','Orange'],
    'Num':[22.1,8.6,7.6,10.2,22.1,8.6],
    'Color':['Yellow','Orange','Green','Green','Red','Orange'],
    })

Results in this:

# for df1

         Date   Fruit   Num   Color
0  2013-11-24  Banana  22.1  Yellow
1  2013-11-24  Orange   8.6  Orange
2  2013-11-24   Apple   7.6   Green
3  2013-11-24  Celery  10.2   Green


# for df2

         Date   Fruit   Num   Color
0  2013-11-24  Banana  22.1  Yellow
1  2013-11-24  Orange   8.6  Orange
2  2013-11-24   Apple   7.6   Green
3  2013-11-24  Celery  10.2   Green
4  2013-11-25   Apple  22.1     Red
5  2013-11-25  Orange   8.6  Orange


# for df_diff

         Date   Fruit   Num   Color
4  2013-11-25   Apple  22.1     Red
5  2013-11-25  Orange   8.6  Orange

But this answer would not show the rows if the duplicates are in the same DataFrame. For example, if `df1` contains two identical rows but `df2` doesn't contain any of these. — Bohdan Pylypenko, May 12 '22 at 08:46
@BohdanPylypenko - True! But I am taking it as given that folks get their data within each set unique before they ever get to a step of comparing across separate datasets. (If they don't they are setting themselves up for a confusing jumble of issues in source and across sources to sort out all at once.) — leerssej, Jun 16 '22 at 06:17

jur · Answer 3 · 2019-01-25T12:16:26.570

26

Passing the dataframes to concat in a dictionary, results in a multi-index dataframe from which you can easily delete the duplicates, which results in a multi-index dataframe with the differences between the dataframes:

import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO
import pandas as pd

DF1 = StringIO("""Date       Fruit  Num  Color 
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange  8.6 Orange
2013-11-24 Apple   7.6 Green
2013-11-24 Celery 10.2 Green
""")
DF2 = StringIO("""Date       Fruit  Num  Color 
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange  8.6 Orange
2013-11-24 Apple   7.6 Green
2013-11-24 Celery 10.2 Green
2013-11-25 Apple  22.1 Red
2013-11-25 Orange  8.6 Orange""")


df1 = pd.read_table(DF1, sep='\s+')
df2 = pd.read_table(DF2, sep='\s+')
#%%
dfs_dictionary = {'DF1':df1,'DF2':df2}
df=pd.concat(dfs_dictionary)
df.drop_duplicates(keep=False)

Result:

             Date   Fruit   Num   Color
DF2 4  2013-11-25   Apple  22.1     Red
    5  2013-11-25  Orange   8.6  Orange

edited Jan 25 '19 at 12:16

answered Mar 07 '17 at 15:26

jur

514
5
5

1

This is a much easier method, just one more revision may make it more easier. No need to concat in a dictionary, use df = pd.concat([df1,df2]) would do the same – ling Mar 20 '17 at 11:29
you should not overwrite built-in keyword `dict`! – denfromufa Jul 23 '17 at 01:18
Is there a way to add to this to determine which data frame contained the unique row? – jlewkovich Jan 23 '19 at 21:27
You can tell by the first level in the multiindex which contains the key of the dataframe in the dictionary (I updated the output with the correct keys) – jur Jan 25 '19 at 12:15

score 26 · Answer 4 · answered Oct 31 '19 at 17:09

26

# THIS WORK FOR ME

# Get all diferent values
df3 = pd.merge(df1, df2, how='outer', indicator='Exist')
df3 = df3.loc[df3['Exist'] != 'both']


# If you like to filter by a common ID
df3  = pd.merge(df1, df2, on="Fruit", how='outer', indicator='Exist')
df3  = df3.loc[df3['Exist'] != 'both']

answered Oct 31 '19 at 17:09

Ivan Moran

361
3
6

this is the best answer – moshevi Sep 13 '20 at 15:07
This works really well for multi-column dataframes. – Amadeus Stevenson May 30 '23 at 17:56

score 20 · Answer 5 · answered Jul 31 '20 at 09:40

20

Since pandas >= 1.1.0 we have DataFrame.compare and Series.compare.

Note: the method can only compare identically-labeled DataFrame objects, this means DataFrames with identical row and column labels.

df1 = pd.DataFrame({'A': [1, 2, 3],
                    'B': [4, 5, 6],
                    'C': [7, np.NaN, 9]})

df2 = pd.DataFrame({'A': [1, 99, 3],
                    'B': [4, 5, 81],
                    'C': [7, 8, 9]})

   A  B    C
0  1  4  7.0
1  2  5  NaN
2  3  6  9.0 

    A   B  C
0   1   4  7
1  99   5  8
2   3  81  9

df1.compare(df2)

     A          B          C      
  self other self other self other
1  2.0  99.0  NaN   NaN  NaN   8.0
2  NaN   NaN  6.0  81.0  NaN   NaN

answered Jul 31 '20 at 09:40

Erfan

40,971
8
66
78

Thank you for this information. I haven't moved to 1.1 yet, but this is good to know. – Eric D. Brown D.Sc. Jul 31 '20 at 14:18
2

compare only works if the 2 dataFrames are at the same size. right? – Rebin Aug 26 '21 at 22:35
1

Yes, see the note in my answer @Rebin – Erfan Nov 23 '21 at 11:20

score 6 · Answer 6 · answered Feb 23 '16 at 10:03

6

Building on alko's answer that almost worked for me, except for the filtering step (where I get: ValueError: cannot reindex from a duplicate axis), here is the final solution I used:

# join the dataframes
united_data = pd.concat([data1, data2, data3, ...])
# group the data by the whole row to find duplicates
united_data_grouped = united_data.groupby(list(united_data.columns))
# detect the row indices of unique rows
uniq_data_idx = [x[0] for x in united_data_grouped.indices.values() if len(x) == 1]
# extract those unique values
uniq_data = united_data.iloc[uniq_data_idx]

answered Feb 23 '16 at 10:03

fnl

4,861
4
27
32

Nice addition to the answer. Thanks – Eric D. Brown D.Sc. Feb 23 '16 at 13:19
1

I'm getting the error,' `IndexError: index out of bounds'`, when I try to run the third line. – Moondra Mar 23 '17 at 21:07

score 5 · Answer 7 · edited Apr 30 '21 at 20:29

5

Get the existing data from df2 into df1:

dfe = df2[df2["Fruit"].isin(df1["Fruit"])]

Get the non-existing data from df2 into df1:

dfn = df2[~ df2["Fruit"].isin(df1["Fruit"])]

You can use more than one comparison.

edited Apr 30 '21 at 20:29

Tomerikoo

18,379
16
47
61

answered Apr 30 '21 at 17:58

Alex Alcalá

61
1
3

Works great! Thank you – dimButTries Oct 05 '21 at 19:23

score 4 · Answer 8 · answered Aug 27 '18 at 22:12

4

Founder a simple solution here:

https://stackoverflow.com/a/47132808/9656339

pd.concat([df1, df2]).loc[df1.index.symmetric_difference(df2.index)]

answered Aug 27 '18 at 22:12

Tom2shoes

105
2
7

1

Welcome to Stack Overflow Tom2shoes. Please don't provide link-only answers, try to extract the content from the link and leave it only as a reference (as the content in the link can be deleted or the link itself can break). For more information refer to ["How do I write a good answer?"](https://stackoverflow.com/help/how-to-answer). If you believe this question was already answered in another question, please mark it as a duplicate. – GGG Aug 27 '18 at 22:34

score 3 · Answer 9 · answered Aug 25 '17 at 10:16

There is a simpler solution that is faster and better, and if the numbers are different can even give you quantities differences:

df1_i = df1.set_index(['Date','Fruit','Color'])
df2_i = df2.set_index(['Date','Fruit','Color'])
df_diff = df1_i.join(df2_i,how='outer',rsuffix='_').fillna(0)
df_diff = (df_diff['Num'] - df_diff['Num_'])

Here df_diff is a synopsis of the differences. You can even use it to find the differences in quantities. In your example:

Explanation: Similarly to comparing two lists, to do it efficiently we should first order them then compare them (converting the list to sets/hashing would also be fast; both are an incredible improvement to the simple O(N^2) double comparison loop

Note: the following code produces the tables:

df1=pd.DataFrame({
    'Date':['2013-11-24','2013-11-24','2013-11-24','2013-11-24'],
    'Fruit':['Banana','Orange','Apple','Celery'],
    'Num':[22.1,8.6,7.6,10.2],
    'Color':['Yellow','Orange','Green','Green'],
})
df2=pd.DataFrame({
    'Date':['2013-11-24','2013-11-24','2013-11-24','2013-11-24','2013-11-25','2013-11-25'],
    'Fruit':['Banana','Orange','Apple','Celery','Apple','Orange'],
    'Num':[22.1,8.6,7.6,10.2,22.1,8.6],
    'Color':['Yellow','Orange','Green','Green','Red','Orange'],
})

SpeedCoder5 · Answer 10 · 2018-06-07T18:07:27.677

# given
df1=pd.DataFrame({'Date':['2013-11-24','2013-11-24','2013-11-24','2013-11-24'],
    'Fruit':['Banana','Orange','Apple','Celery'],
    'Num':[22.1,8.6,7.6,10.2],
    'Color':['Yellow','Orange','Green','Green']})
df2=pd.DataFrame({'Date':['2013-11-24','2013-11-24','2013-11-24','2013-11-24','2013-11-25','2013-11-25'],
    'Fruit':['Banana','Orange','Apple','Celery','Apple','Orange'],
    'Num':[22.1,8.6,7.6,1000,22.1,8.6],
    'Color':['Yellow','Orange','Green','Green','Red','Orange']})

# find which rows are in df2 that aren't in df1 by Date and Fruit
df_2notin1 = df2[~(df2['Date'].isin(df1['Date']) & df2['Fruit'].isin(df1['Fruit']) )].dropna().reset_index(drop=True)

# output
print('df_2notin1\n', df_2notin1)
#      Color        Date   Fruit   Num
# 0     Red  2013-11-25   Apple  22.1
# 1  Orange  2013-11-25  Orange   8.6

eyquem · Answer 11 · 2013-11-26T20:09:21.697

I got this solution. Does this help you ?

text = """df1:
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 8.6 Orange
2013-11-24 Apple 7.6 Green
2013-11-24 Celery 10.2 Green

df2:
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 8.6 Orange
2013-11-24 Apple 7.6 Green
2013-11-24 Celery 10.2 Green
2013-11-25 Apple 22.1 Red
2013-11-25 Orange 8.6 Orange



argetz45
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 118.6 Orange
2013-11-24 Apple 74.6 Green
2013-11-24 Celery 10.2 Green
2013-11-25     Nuts    45.8 Brown
2013-11-25 Apple 22.1 Red
2013-11-25 Orange 8.6 Orange
2013-11-26   Pear 102.54    Pale"""

.

from collections import OrderedDict
import re

r = re.compile('([a-zA-Z\d]+).*\n'
               '(20\d\d-[01]\d-[0123]\d.+\n?'
               '(.+\n?)*)'
               '(?=[ \n]*\Z'
                  '|'
                  '\n+[a-zA-Z\d]+.*\n'
                  '20\d\d-[01]\d-[0123]\d)')

r2 = re.compile('((20\d\d-[01]\d-[0123]\d) +([^\d.]+)(?<! )[^\n]+)')

d = OrderedDict()
bef = []

for m in r.finditer(text):
    li = []
    for x in r2.findall(m.group(2)):
        if not any(x[1:3]==elbef for elbef in bef):
            bef.append(x[1:3])
            li.append(x[0])
    d[m.group(1)] = li


for name,lu in d.iteritems():
    print '%s\n%s\n' % (name,'\n'.join(lu))

result

df1
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 8.6 Orange
2013-11-24 Apple 7.6 Green
2013-11-24 Celery 10.2 Green

df2
2013-11-25 Apple 22.1 Red
2013-11-25 Orange 8.6 Orange

argetz45
2013-11-25     Nuts    45.8 Brown
2013-11-26   Pear 102.54    Pale

Thanks for the help. I saw the answer by @alko and that code worked well. — Eric D. Brown D.Sc., Nov 27 '13 at 00:48

score 1 · Answer 12 · answered Jun 21 '19 at 09:20

I tried this method, and it worked. I hope it can help too:

"""Identify differences between two pandas DataFrames"""
df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
df_all = pd.concat([df1, df12], axis='columns', keys=['First', 'Second'])
df_final = df_all.swaplevel(axis='columns')[df1.columns[1:]]
df_final[df_final['change this to one of the columns'] != df_final['change this to one of the columns']]

score 1 · Answer 13 · answered Aug 16 '21 at 15:21

use merge outer to find the left outer values whose value is null

txt1="""Date,Fruit,Num,Color 
2013-11-24,Banana,22.1,Yellow
2013-11-24,Orange,8.6,Orange
2013-11-24,Apple,7.6,Green
2013-11-24,Celery,10.2,Green"""

txt2="""Date,Fruit,Num,Color 
2013-11-24,Banana,22.1,Yellow
2013-11-24,Orange,8.6,Orange
2013-11-24,Apple,7.6,Green
2013-11-24,Celery,10.2,Green
2013-11-25,Apple,22.1,Red
2013-11-25,Orange,8.6,Orange"""

from io import StringIO
f = StringIO(txt1)
df1 = pd.read_table(f,sep =',')
df1.set_index('Date',inplace=True)

f = StringIO(txt2)
df2 = pd.read_table(f,sep =',')
df2.set_index('Date',inplace=True)

df3 =pd.merge(df2, df1, left_index=True, right_index=True,  how='outer', 
     indicator=True
         ,suffixes=("", "_left")
         ).query("_merge=='left_only'")
remove_columns=[item for item in df3.columns if '_left' in item]
remove_columns.append('_merge')
df3=df3.drop(columns=remove_columns)
print(df3)

output:

         Date   Fruit   Num  Color 
0  2013-11-25   Apple  22.1     Red
1  2013-11-25  Orange   8.6  Orange

score 0 · Answer 14 · answered Mar 03 '18 at 23:20

One important detail to notice is that your data has duplicate index values, so to perform any straightforward comparison we need to turn everything as unique with df.reset_index() and therefore we can perform selections based on conditions. Once in your case the index is defined, I assume that you would like to keep de index so there are a one-line solution:

[~df2.reset_index().isin(df1.reset_index())].dropna().set_index('Date')

Once the objective from a pythonic perspective is to improve readability, we can break a little bit:

# keep the index name, if it does not have a name it uses the default name
index_name = df.index.name if df.index.name else 'index' 

# setting the index to become unique
df1 = df1.reset_index()
df2 = df2.reset_index()

# getting the differences to a Dataframe
df_diff = df2[~df2.isin(df1)].dropna().set_index(index_name)

score 0 · Answer 15 · answered Feb 07 '19 at 06:42

Hope this would be useful to you. ^o^

df1 = pd.DataFrame({'date': ['0207', '0207'], 'col1': [1, 2]})
df2 = pd.DataFrame({'date': ['0207', '0207', '0208', '0208'], 'col1': [1, 2, 3, 4]})
print(f"df1(Before):\n{df1}\ndf2:\n{df2}")
"""
df1(Before):
   date  col1
0  0207     1
1  0207     2

df2:
   date  col1
0  0207     1
1  0207     2
2  0208     3
3  0208     4
"""

old_set = set(df1.index.values)
new_set = set(df2.index.values)
new_data_index = new_set - old_set
new_data_list = []
for idx in new_data_index:
    new_data_list.append(df2.loc[idx])

if len(new_data_list) > 0:
    df1 = df1.append(new_data_list)
print(f"df1(After):\n{df1}")
"""
df1(After):
   date  col1
0  0207     1
1  0207     2
2  0208     3
3  0208     4
"""

score 0 · Answer 16 · answered Jun 08 '21 at 08:44

You can find the difference between DataFrame row counts:

df2.value_counts().sub(df1.value_counts(), fill_value=0)

Output:

Date        Fruit   Num     Color
2013-11-24  Apple   7.6     Green     0.0
            Banana  22.1    Yellow    0.0
            Celery  10.2    Green    -1.0
                    1000.0  Green     1.0
            Orange  8.6     Orange    0.0
2013-11-25  Apple   22.1    Red       1.0
            Orange  8.6     Orange    1.0
dtype: float6

Comparing two dataframes and getting the differences

16 Answers16

Linked

Related