Find the difference between data frames based on specific columns and output the entire record

Question

I want to compare 2 csv (A and B) and find out the rows which are present in B but not in A in based only on specific columns.

I found few answers to that but it is still not giving result what I expect. Answer 1 :

df = new[~new['column1', 'column2'].isin(old['column1', 'column2'].values)]

This doesn't work. It works for single column but not for multiple.

Answer 2 :

df = pd.concat([old, new]) # concat dataframes
df = df.reset_index(drop=True) # reset the index
df_gpby = df.groupby(list(df.columns)) #group by
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1] #reindex
final = df.reindex(idx)

This takes as an input specific columns and also outputs specific columns. I want to print the whole record and not only the specific columns of the record.

Found the answer here https://stackoverflow.com/a/45198954/1298426 on how to use multiple columns. — user1298426, Aug 23 '20 at 19:42

Dev Randalpura · Answer 1 · 2020-08-23T19:32:16.593

I tried this and it gave me the rows:

import pandas as pd

columns = [{Name of columns you want to use}]
new = pd.merge(A, B, how = 'right', on = columns)

col = new['{Any column from the first DataFrame which isn't in the list columns. You will probably have to add an '_x' at the end of the column name}']
col = col.dropna()

new = new[~new['{Any column from the first DataFrame which isn't in the list columns. You will probably have to add an '_x' at the end of the column name}'].isin(col)]

This will give you the rows based on the columns list. Sorry for the bad naming. If you want to rename the columns a bit too, here's the code for that:

for column in new.columns:
    
    if '_x' in column:
        new = new.drop(column, axis = 1)
        
    elif '_y' in column:
        new = new.rename(columns = {column: column[:column.find('_y')]})

Tell me if it works.

AttributeError: 'DataFrame' object has no attribute 'rank_x' — user1298426, Aug 23 '20 at 19:17
Oh yeah I forgot, it isn't 'rank_x'. Just use any column from DataFrame A that isn't in the list `columns`. — Dev Randalpura, Aug 23 '20 at 19:22

Find the difference between data frames based on specific columns and output the entire record

1 Answers1