1

I have two pandas dataframes. Lets say the first one is master

ID  COL1    COL2
1   A       AA
2   B       BB
3   C       CC
4   D       DD

And another one source

ID  COL1    COL2
1   A       ZZ
2   B       BB
3   YY      CC
5   G       GG
6   H       HH

Evidently the length can be different and the difference can be in more than one column. However, the structure will be the same. I want to find the records in the source that are either new or different from what is available in master. That is, the output I am looking for is a dataframe:

ID  COL1    COL2
1   A       ZZ
3   YY      CC
5   G       GG
6   H       HH

I tried solutions in:

But none of those seems to be working for me. This is basically trying to find out what's new.

pallupz
  • 793
  • 3
  • 9
  • 25

3 Answers3

3

Use merge with indicator=True and outer join, then filter and get only columns by df2.columns:

#specified columns in list
cols = ['COL1','COL2']
#all columns without ID
#cols = df.columns.difference(['ID'])
df = (df1.merge(df2, on=cols, how='outer', indicator=True, suffixes=('_',''))
         .query("_merge == 'right_only'")[df2.columns])
print (df)
    ID COL1 COL2
4  1.0    A   ZZ
5  3.0   YY   CC
6  5.0    G   GG
7  6.0    H   HH
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
2

you can create a mask and use boolean indexing:

# set index
source = source.set_index('ID')
master = master.set_index('ID')

# find any record across rows where source is not in master
mask = (~source.isin(master)).any(1)
# boolean indexing
source[mask]

   COL1 COL2
ID          
1     A   ZZ
3    YY   CC
5     G   GG
6     H   HH
It_is_Chris
  • 13,504
  • 2
  • 23
  • 41
  • Solution is wrong generally, only working with sample data. Try change order of some rows for check it. – jezrael Feb 06 '19 at 08:20
1

There's several ways to approach this depending on how you with to treat memory allocation and if you intend to work with large datasets or if it's just for academic/training purposes.

  1. Iterate through the comparisons, and append them to a new dataframe. (More code, more efficient memory wise)
  2. Create a new merged (outer) dataframe and apply a function to remove duplicates. (Less code, but less efficient on memory)

These are just two ideas, but there might be more, it's only meant to give an insight.

Solution 1: (considering IDs are unique and not an index)

list = source['ID'].tolist() #get a list of all the ids in source
results = pd.DataFrame(columns = source.columns.tolist()) #Creates an empty df with same columns
for id in list:
    if(~((source[id]['COL1'] == master[id]['COL1']) & (source[id]['COL2'] == master[id]['COL2']))):
    #Here we evaluate the cases where everything is equal and execute on negation of said statement (by using ~, which equates to NOT)
        results.append(source[id])

Solution 2:

results = source.merge(master, how = 'outer', on= source.columns.tolist()) #assuming both dfs have same columns
final_results = results.drop_duplicates(Keep = False) #this will drop all rows that are duplicated.
Matías Romo
  • 100
  • 6