Group rows with mixed data types and similarity threshold

Question

my database contains about 10 million customers. I have information on their email address, first and last name, address, zipcode and phone number. As in the example below the data quality is poor as many values are missing and there is no fixed column where I can use a group by statement in SQL or Pandas. I need to group very similar IDs into a single accountID in the first step.

ID	Email	FirstName	LastName	Adress	Adress1	Adress2	Zipcode	Mobilephone	Phone
1	Donald.duck@disney.com	Donald	Duck				12345		+4912345
2	Donald.duck@disney.com	Donald	Duck	Mainstreet 1	main	1			+4912345
3	duck.d@gmail.com	Donald	Duck	Main str 1	main	1	12345	+5054321

I need something like a group by with a threshold, like "if more than 70 % of the not none values are equal, please combine them".

This is what I have tried already. It is working really good, but it takes like forever (see the graph at the end). So is there another way to achieve this faster? I am thinking about something like numpy, list comparisons, pyspark or do it somehow directly in mysql 5.x or mariadb. Any ideas?

def find_duplicates(df,Id,cid):
  idx = df[df.inID == inId].index
  tmp_mail = df.loc[idx].Email.values[0]
  row = df.loc[idx.values[0], ['Email', 'FirstName', 'LastName', 'Adress1','Adress2', 'Zipcode', 'Birthdate', 'Mobilephone','Phone']].dropna()
  if len(row)>3:
      res_email = df.loc[df.inEmail == tmp_mail]
      res_match = df[(df == row).sum(axis=1) >= round(len(row)) * 0.7]
      res = res_email.append(res_match).drop_duplicates()
      res['customer_id'] = cid
      res['ctd_emails'] = res.Email.nunique()
      res['ctd_inid'] = res.ID.nunique()
      return df.drop(index=res.index), res.drop(columns = ['Adress1','Adress2'])
  else:
      res = df.loc[df.inEmail == tmp_mail]
      res['customer_id'] = cid
      res['ctd_emails'] = res.Email.nunique()
      res['ctd_inid'] = res.ID.nunique()
      return df.drop(index=res.index), res.drop(columns = ['Adress1','Adress2'])

ctd=0
while len(df)>0:
    df,tmp = find_duplicates(df,df.loc[df.index.min(),'ID'],ctd)
    if ctd == 0:
        res = tmp
    else:
        res = res.append(tmp)
    ctd += 1

I already did my research and tried stuff like described in this link e.g. How can I compare two lists in python and return matches

Thank you in advance

Far to huge a subject for a simp[le answer here. There are companies that do just this and nothing else. They charge a pretty penny because the process is far from simple and usually need a human to eyeball a lot of the potential amendment identified by any algorithm — RiggsFolly, Aug 26 '21 at 16:53
You may want to look into this function, if you're considering using pandas: https://pandas.pydata.org/docs/reference/api/pandas.Series.combine_first.html . You may want to consider using some variant of `COALESCE` if you're using SQL. I agree with the prior comment -- this is not simple. You should carefully and clearly define your criteria for merging rows. — Mose Wintner, Aug 27 '21 at 00:14

Group rows with mixed data types and similarity threshold

0 Answers0