0

I have several hundred project files in excel all containing client contact information with various levels of completeness. I've crawled all the files and extracted all the data into a single data frame.

I'm having trouble figuring out a method to combine the partial entries into a single complete entry. I've included a sample dataframe below that should capture most cases.

d = {
    'contact_name':['doug dougson',np.nan,'Doug','doug'],
    'email_address':['doug@email.com','doug@email.com','doug@email.com',np.nan],
    'company_name':['dougs company','np.nan','dougs company',np.nan],
    'contact_number':[np.nan,1234567890,np.nan,1234567890]
    }
        
f = pd.DataFrame(data=d,dtype='object')

My desired output is:

c = {
    'contact_name':['doug dougson'],
    'email_address':['doug@email.com'],
    'company_name':['dougs company'],
    'contact_number':[1234567890]
    }

print(pd.DataFrame(data=c))

I would say 90% of the time email addresses will be available but in the rare case I'll have other client information but no email address.

my end goal is to reduce the 500+ partially complete entries into a dataframe with only the unique entries with the most complete information possible.

to complete the entries I assumed I would inner join on email address but I'm having trouble getting the functionality right with df.merge().

TIA

  • Are `email_address` and `contact_number` guaranteed to only have one value other than `NaN`? If not, what do you want to do in that situation? What is your method for choosing `doug dougson` out of `['doug dougson',np.nan,'Doug','doug']`? – Nick Mar 01 '23 at 22:23
  • Hey Nick, in a given row there will only ever be one value per column. but it is possible (rarely) that doug dougson has different email addresses or contact numbers assigned to him in different rows. I planned on using the email_address as the primary key and matching/building the entries around the email_address doug dougson would be the optimal choice but I haven't thought that far on how to solve that problem yet. my intial thought process was to select most frequent when notna – canconfirm24 Mar 01 '23 at 22:40
  • On further thought in order to indentify the most complete name. I would `df[['first_name','last_name']] = df.contact_name.str.split(expand=True)` Then to identify the most complete entries I would create a helper column counting the notnull values: `df.notnull().sum(axis=1)` From there I can `df['helper2']=df.groupby('email_address')['helper'].transform(max)` to get the max not null value for a given email address I can then subset only the most complete entries by `max_df = [df['helper'] == df['helper2']]` I'm kind of lost on how to fill in missing values from here – canconfirm24 Mar 02 '23 at 15:41

1 Answers1

0

I think I've managed to solve this solution. I referenced this question (Pandas fill missing values in dataframe from another dataframe) to help solve my problem.

first I wanted to breakout the names into first and last columns for the total dataframe.

df[['first_name','last_name']] = df.contact_name.str.split(expand=True)

Next I needed to create a helper column to identify the most complete rows. I'm summing all the notnull() values across all rows

df['helper'] = df.notnull().sum(axis=1)

This is a bit of an unnecessary step and could have been completed in one line when creating the new data frame but I liked having the extra column as a visual

df['helper2']=df.groupby('email_address')['helper'].transform(max)

Now create a df of only the entries that are == to the most complete entries and drop duplicates

max_df = [df['helper'] == df['helper2']]
max_df = max_df.drop_duplicates()

Now it's time to starting filling in missing values but first we need to reset the index's on both dataframes to 'email_address' allows us to match and combine on the common index so:

df = df.set_index('email_address')
max_df = max_df.set_index('email_address')

Fill missing values

max_df.combine_first(df).drop_duplicates()

It's not pretty but I think this is the most complete product I could produce. If anyone has any improvements or suggestions. I would definitely like to build and improve upon this product.