I have a set of people in a dataframe, I need the list of people that do not occur in the main dataset. Currently I am checking against first and last name.
data_to_check_dataset
is the input data that needs to be checked , it contains many columns but currently I only need to check against first_name
, last_name
.
first_name | last_name | ... | |
---|---|---|---|
0 | James | Apple | ... |
1 | Alice | test | ... |
... | ... | ... | ... |
10000 | Paul | test | ... |
sometimes the data fields can be entirely blank and are read as nan values.
first_name | last_name | ... | |
---|---|---|---|
0 | James comp | nan | ... |
1 | Paul ltd | nan | ... |
... | ... | ... | ... |
10000 | Paul other | nan | ... |
The dataframe I am checking against current_people_dataset
: , it contains many columns I have renamed the name columns to first_name
, last_name
. Its null values are blank for some reason, I think because
first_name | last_name | ... | |
---|---|---|---|
0 | f_A | l_A | ... |
1 | B | ... | |
... | ... | ... | ... |
900000 | paul | smith | ... |
The data_to_check_dataset
is always smaller then the current_people_dataset
.
Column ordering is not fixed and can change depending on here the data is loaded in from.
currently I have been trying to adapt the code from here.
new_people_names = (pd.merge(data_to_check_dataset,current_people_dataset, indicator=True, how='outer')
.query('_merge=="left_only"')
.drop('_merge', axis=1))
This raises ValueError: You are trying to merge on float64 and object columns. If you wish to proceed you should use pd.concat
error when comparing colunmns.