I have several hundred project files in excel all containing client contact information with various levels of completeness. I've crawled all the files and extracted all the data into a single data frame.
I'm having trouble figuring out a method to combine the partial entries into a single complete entry. I've included a sample dataframe below that should capture most cases.
d = {
'contact_name':['doug dougson',np.nan,'Doug','doug'],
'email_address':['doug@email.com','doug@email.com','doug@email.com',np.nan],
'company_name':['dougs company','np.nan','dougs company',np.nan],
'contact_number':[np.nan,1234567890,np.nan,1234567890]
}
f = pd.DataFrame(data=d,dtype='object')
My desired output is:
c = {
'contact_name':['doug dougson'],
'email_address':['doug@email.com'],
'company_name':['dougs company'],
'contact_number':[1234567890]
}
print(pd.DataFrame(data=c))
I would say 90% of the time email addresses will be available but in the rare case I'll have other client information but no email address.
my end goal is to reduce the 500+ partially complete entries into a dataframe with only the unique entries with the most complete information possible.
to complete the entries I assumed I would inner join on email address but I'm having trouble getting the functionality right with df.merge().
TIA