0

enter image description hereI have over 900,000 records of contacts and need to find a way to merge the records that have the same 'First Name' AND 'Last Name' AND either the same 'Street Address', 'Email', or 'Phone'. I want to keep the record with the most recent 'Last Modified Date' as the main record, and all the duplicates update their 'Contact ID' to the one of the main record. I also want to keep the original contact ID to track in a new column! Of course, I'd like to ignore Nulls when merging. My code is pretty long, so here is a link to download it (I don't want to take up too much space here): https://drive.google.com/file/d/1tJRr4IyHwHf2NF80j1LO7bTDakXRUtVM/view?usp=sharing

lsignori
  • 15
  • 4
  • 1
    If you could it would be helpful to supply the expected [minimal, reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) It's useful to allow quicker testing and to highlight your current misunderstanding in a more direct way instead of being vague about which issues you're having. – Liam Aug 03 '21 at 00:19
  • You may use `groupby` and a couple of aggregate functions, as in here, perhaps: [https://stackoverflow.com/questions/36271413/pandas-merge-nearly-duplicate-rows-based-on-column-value](https://stackoverflow.com/questions/36271413/pandas-merge-nearly-duplicate-rows-based-on-column-value) – massimopinto Aug 03 '21 at 00:26

2 Answers2

1

The idea is to sort the dataframe by modified date and then iterate through the rows saving the combination as a string in a dictionary as a key.

dataframe.sort(by = ['Last Modified Date'], ascending = False)
dataframe['old contact id'] = ""
dict1 = {}
for i in range(len(dataframe)):
    if dict1.get(str(dataframe['First Name'][i] + dataframe['Last Name'][i] + dataframe['email'][i])):
        df.loc[df['contact id'] == dict1[str(dataframe['First Name'][i] + dataframe['Last Name'][i] + dataframe['email'][i])]]['old contact id'] = df['contact id'][i]
    elif dict1.get(str(dataframe['First Name'][i] + dataframe['Last Name'][i] + dataframe['address'][i])):
        df.loc[df['contact id'] == dict1[str(dataframe['First Name'][i] + dataframe['Last Name'][i] + dataframe['address'][i])]]['old contact id'] = df['contact id'][i]
    elif dict1.get(str(dataframe['First Name'][i] + dataframe['Last Name'][i] + dataframe['phone'][i])):
        df.loc[df['contact id'] == dict1[str(dataframe['First Name'][i] + dataframe['Last Name'][i] + dataframe['phone'][i])]]['old contact id'] = df['contact id'][i]
    else:
        dict1[str(dataframe['First Name'][i] + dataframe['Last Name'][i] + dataframe['email'][i])] = = str(dataframe['Contact id'][i]
        dict1[str(dataframe['First Name'][i] + dataframe['Last Name'][i] + dataframe['address'][i])] = = str(dataframe['Contact id'][i]
        dict1[str(dataframe['First Name'][i] + dataframe['Last Name'][i] + dataframe['phone'][i])] = str(dataframe['Contact id'][i]

At last if the old modified is empty, delete the column.

dataframe = dataframe.loc[dataframe['old contact id'] != ""]

The code is long but after sorting it gives O(n) complexity.

0

Having rejected the first answer I posted and then updating the question with more requirements, please do note: This site is not a free code writing service. And the link to your code doesn't work (at least at the moment).

Given:

import pandas as pd
import numpy as np

df = pd.DataFrame({'First_Name': {0: 'Greg',  1: 'Greg',  2: 'John',  3: 'John',  4: 'Ryan',  5: 'Ryan'}, \
                    'Last_Name': {0: 'Li', 1: 'Li', 2: 'Doe', 3: 'Doe', 4: 'Lin', 5: 'Lin'}, \
                    'ContactID': {0: 123, 1: 1877, 2: 566, 3: 234, 4: 789, 5: 52}, \
                    'Last_Modified_Date': {0: '2021-04-08',  1: '2019-05-06',  2: '2018-02-03', \
                                           3: '2014-05-07',  4: '2019-06-07',  5: '2018-06-07'}, \
                    'Email': {0: 'grey.li@gmail.com',  1: 'grey.li@gmail.com',  2: 'Johndeo@yahoo.com',  \
                              3: 'Johndeo@aol.net',  4: 'lin@hotmail.com',  5: np.nan}, \
                    'Address': {0: '44 Sherman',  1: np.nan,  2: '87 Branch Ave',  3: '87 Branch Ave', \
                                4: '84 Newport',  5: np.nan}, 'Phone': {0: '999-999-9999',  1: np.nan, \
                                                                     2: '890-523-4667',  3: np.nan,  4: \
                                                                     '678-900-000',  5: '678-900-000'}})

print(df)

enter image description here

Try:

df['Last_Modified_Date'] = pd.to_datetime(df['Last_Modified_Date'], format='%Y-%m-%d')
df = df.sort_values(by='Last_Modified_Date')
df['AllContactID'] = df['ContactID'].map(str)
df = df.replace(np.nan, '', regex=False)
df = df.groupby(by=['First_Name', 'Last_Name'], as_index=False)\
    .agg({'Last_Modified_Date': 'last', 'ContactID' : 'last', \
          'Email' : ', '.join, 'Address' : ', '.join, 'Phone' : ', '.join, 'AllContactID' : ', '.join})  
df = df.replace(r'(.*?)(,\s)\1', r', \1', regex=True)
df = df.replace(r'^, (.*)$', r'\1', regex=True)
df = df.replace(r', $', r'', regex=True)
#df['AllContactID'] = df.AllContactID.apply(lambda x: list(x.split(', ')))
#df['AllContactID'] = df.AllContactID.apply(lambda x: list(map(int, x)))

print(df)

enter image description here

MDR
  • 2,610
  • 1
  • 8
  • 18