0

my database contains about 10 million customers. I have information on their email address, first and last name, address, zipcode and phone number. As in the example below the data quality is poor as many values are missing and there is no fixed column where I can use a group by statement in SQL or Pandas. I need to group very similar IDs into a single accountID in the first step.

ID Email FirstName LastName Adress Adress1 Adress2 Zipcode Birthdate Mobilephone Phone
1 Donald.duck@disney.com Donald Duck 12345 +4912345
2 Donald.duck@disney.com Donald Duck Mainstreet 1 main 1 +4912345
3 duck.d@gmail.com Donald Duck Main str 1 main 1 12345 +5054321

I need something like a group by with a threshold, like "if more than 70 % of the not none values are equal, please combine them".

This is what I have tried already. It is working really good, but it takes like forever (see the graph at the end). So is there another way to achieve this faster? I am thinking about something like numpy, list comparisons, pyspark or do it somehow directly in mysql 5.x or mariadb. Any ideas?

def find_duplicates(df,Id,cid):
  idx = df[df.inID == inId].index
  tmp_mail = df.loc[idx].Email.values[0]
  row = df.loc[idx.values[0], ['Email', 'FirstName', 'LastName', 'Adress1','Adress2', 'Zipcode', 'Birthdate', 'Mobilephone','Phone']].dropna()
  if len(row)>3:
      res_email = df.loc[df.inEmail == tmp_mail]
      res_match = df[(df == row).sum(axis=1) >= round(len(row)) * 0.7]
      res = res_email.append(res_match).drop_duplicates()
      res['customer_id'] = cid
      res['ctd_emails'] = res.Email.nunique()
      res['ctd_inid'] = res.ID.nunique()
      return df.drop(index=res.index), res.drop(columns = ['Adress1','Adress2'])
  else:
      res = df.loc[df.inEmail == tmp_mail]
      res['customer_id'] = cid
      res['ctd_emails'] = res.Email.nunique()
      res['ctd_inid'] = res.ID.nunique()
      return df.drop(index=res.index), res.drop(columns = ['Adress1','Adress2'])

ctd=0
while len(df)>0:
    df,tmp = find_duplicates(df,df.loc[df.index.min(),'ID'],ctd)
    if ctd == 0:
        res = tmp
    else:
        res = res.append(tmp)
    ctd += 1

I already did my research and tried stuff like described in this link e.g. How can I compare two lists in python and return matches

Execution time

Thank you in advance

pscl
  • 15
  • 1
  • 4
  • Far to huge a subject for a simp[le answer here. There are companies that do just this and nothing else. They charge a pretty penny because the process is far from simple and usually need a human to eyeball a lot of the potential amendment identified by any algorithm – RiggsFolly Aug 26 '21 at 16:53
  • You may want to look into this function, if you're considering using pandas: https://pandas.pydata.org/docs/reference/api/pandas.Series.combine_first.html . You may want to consider using some variant of `COALESCE` if you're using SQL. I agree with the prior comment -- this is not simple. You should carefully and clearly define your criteria for merging rows. – Mose Wintner Aug 27 '21 at 00:14

0 Answers0