I have two large dataframes from different sources, largely of the same structure but of different lengths and in a different order. Most of the data is comparable but not all. The rows represent individuals and the the columns contain data about those individuals. I want to check by row certain column values of one dataframe against the 'master' dataframe and then return the omissions so these can be added to it.
I have been using the df.query()
method to check individual cases using my own inputs because I can search the master dataframe using multiple columns - so, something like df.query('surname == "JONES" and initials == "D V" and town == "LONDON"')
. I want to do something like this but by creating a query of each row of the other dataframe using data from specific columns.
I think I can work out how I might do this using for loops and if statements but that's going to be wildly inefficient and obviously not ideal. List comprehension might be more efficient but I can't work out the dataframe comparison part unless I create a new column whose data is built from the values I want to compare (JONES-DV-LONDON, but that seems wrong).
There is an answer in here I think but it relies on the dataframes being more or less identical (which mine aren't - hence my wish to compare only certain columns).
I have been unable to find an example of someone doing the same, which might be my failure again. I am a novice and I have a feeling I might be thinking about this in completely the wrong way. I would very much value any thoughts and pointers...
EDIT - some sample data (not exactly what I'm using but hopefully helps show what I am trying to do)
df1 (my master list)
surname initials town
JONES D V LONDON
DAVIES H G BIRMINGHAM
df2
surname initials town
DAVIES H G BIRMINGHAM
HARRIS P J SOUTHAMPTON
JONES D V LONDON
I would like to identify the columns to use in the comparison (surname, initials, town here - assume there are more with data that cannot be matched) and then return the unique results from df2 - so in this case a dataframe containing:
surname initials town
HARRIS P J SOUTHAMPTON