my database contains about 10 million customers. I have information on their email address, first and last name, address, zipcode and phone number. As in the example below the data quality is poor as many values are missing and there is no fixed column where I can use a group by statement in SQL or Pandas. I need to group very similar IDs into a single accountID in the first step.
ID | FirstName | LastName | Adress | Adress1 | Adress2 | Zipcode | Birthdate | Mobilephone | Phone | |
---|---|---|---|---|---|---|---|---|---|---|
1 | Donald.duck@disney.com | Donald | Duck | 12345 | +4912345 | |||||
2 | Donald.duck@disney.com | Donald | Duck | Mainstreet 1 | main | 1 | +4912345 | |||
3 | duck.d@gmail.com | Donald | Duck | Main str 1 | main | 1 | 12345 | +5054321 |
I need something like a group by with a threshold, like "if more than 70 % of the not none values are equal, please combine them".
This is what I have tried already. It is working really good, but it takes like forever (see the graph at the end). So is there another way to achieve this faster? I am thinking about something like numpy, list comparisons, pyspark or do it somehow directly in mysql 5.x or mariadb. Any ideas?
def find_duplicates(df,Id,cid):
idx = df[df.inID == inId].index
tmp_mail = df.loc[idx].Email.values[0]
row = df.loc[idx.values[0], ['Email', 'FirstName', 'LastName', 'Adress1','Adress2', 'Zipcode', 'Birthdate', 'Mobilephone','Phone']].dropna()
if len(row)>3:
res_email = df.loc[df.inEmail == tmp_mail]
res_match = df[(df == row).sum(axis=1) >= round(len(row)) * 0.7]
res = res_email.append(res_match).drop_duplicates()
res['customer_id'] = cid
res['ctd_emails'] = res.Email.nunique()
res['ctd_inid'] = res.ID.nunique()
return df.drop(index=res.index), res.drop(columns = ['Adress1','Adress2'])
else:
res = df.loc[df.inEmail == tmp_mail]
res['customer_id'] = cid
res['ctd_emails'] = res.Email.nunique()
res['ctd_inid'] = res.ID.nunique()
return df.drop(index=res.index), res.drop(columns = ['Adress1','Adress2'])
ctd=0
while len(df)>0:
df,tmp = find_duplicates(df,df.loc[df.index.min(),'ID'],ctd)
if ctd == 0:
res = tmp
else:
res = res.append(tmp)
ctd += 1
I already did my research and tried stuff like described in this link e.g. How can I compare two lists in python and return matches
Thank you in advance