2

I have a set of people in a dataframe, I need the list of people that do not occur in the main dataset. Currently I am checking against first and last name.

data_to_check_dataset is the input data that needs to be checked , it contains many columns but currently I only need to check against first_name , last_name.

first_name last_name ...
0 James Apple ...
1 Alice test ...
... ... ... ...
10000 Paul test ...

sometimes the data fields can be entirely blank and are read as nan values.

first_name last_name ...
0 James comp nan ...
1 Paul ltd nan ...
... ... ... ...
10000 Paul other nan ...

The dataframe I am checking against current_people_dataset : , it contains many columns I have renamed the name columns to first_name , last_name. Its null values are blank for some reason, I think because

first_name last_name ...
0 f_A l_A ...
1 B ...
... ... ... ...
900000 paul smith ...

The data_to_check_dataset is always smaller then the current_people_dataset. Column ordering is not fixed and can change depending on here the data is loaded in from.

currently I have been trying to adapt the code from here.

new_people_names = (pd.merge(data_to_check_dataset,current_people_dataset, indicator=True, how='outer')
         .query('_merge=="left_only"')
         .drop('_merge', axis=1))

This raises ValueError: You are trying to merge on float64 and object columns. If you wish to proceed you should use pd.concat error when comparing colunmns.

J_7567
  • 31
  • 3

1 Answers1

0

That's what error is saying so one way is to typecase first_name and last_name of both df's to string by using astype():

data_to_check_dataset[['first_name','last_name']]=data_to_check_dataset[['first_name','last_name']].astype(str)
current_people_dataset[['first_name','last_name']]=current_people_dataset[['first_name','last_name']].astype(str)

Finally chain replace() to your current method for converting string nan back to real NaN:

new_people_names = (pd.merge(data_to_check_dataset,current_people_dataset, indicator=True, how='outer',on=['first_name','last_name'])
         .query('_merge=="left_only"')
         .drop('_merge', axis=1)
         .replace('nan',float('NaN'),regex=True))
Anurag Dabas
  • 23,866
  • 9
  • 21
  • 41