How to remove rows in a Pandas dataframe if there exits matches across a set of columns in another dataframe?

Question

I have a set of people in a dataframe, I need the list of people that do not occur in the main dataset. Currently I am checking against first and last name.

data_to_check_dataset is the input data that needs to be checked , it contains many columns but currently I only need to check against first_name , last_name.

	first_name	last_name	...
0	James	Apple	...
1	Alice	test	...
...	...	...	...
10000	Paul	test	...

sometimes the data fields can be entirely blank and are read as nan values.

	first_name	last_name	...
0	James comp	nan	...
1	Paul ltd	nan	...
...	...	...	...
10000	Paul other	nan	...

The dataframe I am checking against current_people_dataset : , it contains many columns I have renamed the name columns to first_name , last_name. Its null values are blank for some reason, I think because

	first_name	last_name	...
0	f_A	l_A	...
1	B		...
...	...	...	...
900000	paul	smith	...

The data_to_check_dataset is always smaller then the current_people_dataset. Column ordering is not fixed and can change depending on here the data is loaded in from.

currently I have been trying to adapt the code from here.

new_people_names = (pd.merge(data_to_check_dataset,current_people_dataset, indicator=True, how='outer')
         .query('_merge=="left_only"')
         .drop('_merge', axis=1))

This raises ValueError: You are trying to merge on float64 and object columns. If you wish to proceed you should use pd.concat error when comparing colunmns.

sorry I have no ideas why it delisted the answer , I did not delist it. — J_7567, Aug 12 '21 at 12:44

score 0 · Accepted Answer · answered Aug 11 '21 at 11:31

That's what error is saying so one way is to typecase first_name and last_name of both df's to string by using astype():

data_to_check_dataset[['first_name','last_name']]=data_to_check_dataset[['first_name','last_name']].astype(str)
current_people_dataset[['first_name','last_name']]=current_people_dataset[['first_name','last_name']].astype(str)

Finally chain replace() to your current method for converting string nan back to real NaN:

new_people_names = (pd.merge(data_to_check_dataset,current_people_dataset, indicator=True, how='outer',on=['first_name','last_name'])
         .query('_merge=="left_only"')
         .drop('_merge', axis=1)
         .replace('nan',float('NaN'),regex=True))

How to remove rows in a Pandas dataframe if there exits matches across a set of columns in another dataframe?

1 Answers1