How do I systematically compare all rows in two Pandas dataframes using specific columns and return the differences?

Question

I have two large dataframes from different sources, largely of the same structure but of different lengths and in a different order. Most of the data is comparable but not all. The rows represent individuals and the the columns contain data about those individuals. I want to check by row certain column values of one dataframe against the 'master' dataframe and then return the omissions so these can be added to it.

I have been using the df.query() method to check individual cases using my own inputs because I can search the master dataframe using multiple columns - so, something like df.query('surname == "JONES" and initials == "D V" and town == "LONDON"'). I want to do something like this but by creating a query of each row of the other dataframe using data from specific columns.

I think I can work out how I might do this using for loops and if statements but that's going to be wildly inefficient and obviously not ideal. List comprehension might be more efficient but I can't work out the dataframe comparison part unless I create a new column whose data is built from the values I want to compare (JONES-DV-LONDON, but that seems wrong).

There is an answer in here I think but it relies on the dataframes being more or less identical (which mine aren't - hence my wish to compare only certain columns).

I have been unable to find an example of someone doing the same, which might be my failure again. I am a novice and I have a feeling I might be thinking about this in completely the wrong way. I would very much value any thoughts and pointers...

EDIT - some sample data (not exactly what I'm using but hopefully helps show what I am trying to do)

df1 (my master list)
surname    initials    town
JONES      D V         LONDON
DAVIES     H G         BIRMINGHAM

df2
surname    initials    town
DAVIES     H G         BIRMINGHAM
HARRIS     P J         SOUTHAMPTON
JONES      D V         LONDON

I would like to identify the columns to use in the comparison (surname, initials, town here - assume there are more with data that cannot be matched) and then return the unique results from df2 - so in this case a dataframe containing:

surname    initials    town
HARRIS     P J         SOUTHAMPTON

Done - it's not what I am actually using but hopefully shows what I am trying to do. There are more columns in my data which will not match, so I would like to specify which columns to compare. — gmh, Apr 13 '20 at 12:55

score 0 · Answer 1 · answered Apr 14 '20 at 09:22

define columns to join:

cols = ['surname', 'initials', 'town']

Than you can use merge with parameter indicator=True which shows appearance of the data (left_only, right_only or both) :

df_res = df1.merge(df2, 'outer',on=cols, indicator=True)

and exclude rows appear in both dataframes:

df_res = df_res[df_res['_merge'] != 'both']
print(df_res)

    surname initials    town        _merge
2   HARRIS  P J         SOUTHAMPTON right_only

you can filter by left_only or right only.

This is great, thank you so much! I have learnt a lot applying this to my actual data — gmh, Apr 15 '20 at 10:14

How do I systematically compare all rows in two Pandas dataframes using specific columns and return the differences?

1 Answers1