Comparing all columns on two pandas dataframes to get the difference

Question

I have two pandas dataframes. Lets say the first one is master

ID  COL1    COL2
1   A       AA
2   B       BB
3   C       CC
4   D       DD

And another one source

ID  COL1    COL2
1   A       ZZ
2   B       BB
3   YY      CC
5   G       GG
6   H       HH

Evidently the length can be different and the difference can be in more than one column. However, the structure will be the same. I want to find the records in the source that are either new or different from what is available in master. That is, the output I am looking for is a dataframe:

ID  COL1    COL2
1   A       ZZ
3   YY      CC
5   G       GG
6   H       HH

I tried solutions in:

But none of those seems to be working for me. This is basically trying to find out what's new.

Sort of the reverse of [this](https://stackoverflow.com/questions/33524000/compare-multiple-columns-to-get-rows-that-are-different-in-two-pandas-dataframes/) - those answers might help you — Josh Friedlander, Feb 05 '19 at 13:17

jezrael · Accepted Answer · 2019-02-05T13:27:22.193

Use merge with indicator=True and outer join, then filter and get only columns by df2.columns:

#specified columns in list
cols = ['COL1','COL2']
#all columns without ID
#cols = df.columns.difference(['ID'])
df = (df1.merge(df2, on=cols, how='outer', indicator=True, suffixes=('_',''))
         .query("_merge == 'right_only'")[df2.columns])
print (df)
    ID COL1 COL2
4  1.0    A   ZZ
5  3.0   YY   CC
6  5.0    G   GG
7  6.0    H   HH

It_is_Chris · Answer 2 · 2019-02-06T14:31:55.820

2

you can create a mask and use boolean indexing:

# set index
source = source.set_index('ID')
master = master.set_index('ID')

# find any record across rows where source is not in master
mask = (~source.isin(master)).any(1)
# boolean indexing
source[mask]

   COL1 COL2
ID          
1     A   ZZ
3    YY   CC
5     G   GG
6     H   HH

edited Feb 06 '19 at 14:31

answered Feb 05 '19 at 14:01

It_is_Chris

13,504
2
23
41

Solution is wrong generally, only working with sample data. Try change order of some rows for check it. – jezrael Feb 06 '19 at 08:20

score 1 · Answer 3 · answered Feb 05 '19 at 17:24

There's several ways to approach this depending on how you with to treat memory allocation and if you intend to work with large datasets or if it's just for academic/training purposes.

Iterate through the comparisons, and append them to a new dataframe. (More code, more efficient memory wise)
Create a new merged (outer) dataframe and apply a function to remove duplicates. (Less code, but less efficient on memory)

These are just two ideas, but there might be more, it's only meant to give an insight.

Solution 1: (considering IDs are unique and not an index)

list = source['ID'].tolist() #get a list of all the ids in source
results = pd.DataFrame(columns = source.columns.tolist()) #Creates an empty df with same columns
for id in list:
    if(~((source[id]['COL1'] == master[id]['COL1']) & (source[id]['COL2'] == master[id]['COL2']))):
    #Here we evaluate the cases where everything is equal and execute on negation of said statement (by using ~, which equates to NOT)
        results.append(source[id])

Solution 2:

results = source.merge(master, how = 'outer', on= source.columns.tolist()) #assuming both dfs have same columns
final_results = results.drop_duplicates(Keep = False) #this will drop all rows that are duplicated.

Comparing all columns on two pandas dataframes to get the difference

3 Answers3