0

I have customer transaction data over a period of time, where they have used multiple emails and phone number over the time. How I can associate all emails and phone numbers to one ID. My current dataframe is in following format

Name               Phone        Email
Ram                123456788    ram@gmail.com
Ram Fernandes      123456788    r1@live.com
Ram F              213456789    r1@live.com
Ram Fern           213456789    ram@msft.com
Matthews           123456798    Matt@msft.com

Output should look like

ID      Name
Cust 1  Ram
Cust 1  Ram Fernandes
Cust 1  Ram F
Cust 1  Ram Fern
Cust 2  Matthews

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
Pulkit
  • 3
  • 2
  • What rules are you applying to define a single customer ID? – Joris C. Jun 30 '19 at 15:12
  • Hi! Please see https://stackoverflow.com/questions/42921674/assign-unique-id-based-on-two-columns. I believe it accomplishes what you need ^_^ – OctoCatKnows Jun 30 '19 at 15:36
  • @JorisChau In this case Record 1 and 2 have same phone different email, so both are same customer. Between Record 2 and 3, same email, so now 1,2 and 3 are same customer. Now 3 and 4 have number. so now all 1,2,3 and 4 are same customer. – Pulkit Jun 30 '19 at 16:51
  • @BuffsGrad16 its different as student value never changes value for same id in that example. it can vary in this example – Pulkit Jun 30 '19 at 16:53
  • 2
    I think your problem is, in essence, very similar to the one described [here](https://stackoverflow.com/q/56740990/5793905). Maybe you can adapt that graph-based approach. – Alexis Jun 30 '19 at 17:08
  • @Alexis Thanks a lot. Indeed its very similar. – Pulkit Jun 30 '19 at 20:28

1 Answers1

0

We can extract the first part of email before the word "@" to get the firstName and then use for loop to see if any of the previous entry has the same firstName or Phone. If there is match use that ID or increment the ID variable.

df$ID <- "Cust 1"
ind <- 1
df$firstName <- sub("(.*)@.*", "\\1", df$Email)

for (i in 2:nrow(df)) {
   if (with(df, is.na(Phone[i]) | Phone[i] == "" | Phone[i] == 0 | 
      is.na(firstName[i]) | firstName[i] == ""))
       df$ID[i] <- NA
   else {
   #check if current Phone or Firstname is similar to any of the previous entry
     inds <- with(df, Phone[i] == Phone[1:(i-1)] | 
                      firstName[i] == firstName[1:(i-1)])
     if (any(inds)) 
     #Get the ID of similar entry 
       df$ID[i] <- df$ID[which.max(inds)]
     else {
     #If there is no match then give a new ID
       ind = ind + 1
       df$ID[i] <- paste("Cust", ind)
     }
   }
}


df
#          Name     Phone         Email firstName     ID
#1          Ram 123456788 ram@gmail.com       ram Cust 1
#2 RamFernandes 123456788   r1@live.com        r1 Cust 1
#3         RamF 213456789   r1@live.com        r1 Cust 1
#4      RamFern 213456789  ram@msft.com       ram Cust 1
#5     Matthews 123456798 Matt@msft.com      Matt Cust 2
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • first part of the email doesnt necessary have to be same name, it could be ram@gmail.com or r123@gmail.com – Pulkit Jun 30 '19 at 13:21
  • @Pulkit then how do you plan to identify similar names? – Ronak Shah Jun 30 '19 at 13:22
  • I want to associate them like record 1 and 2, phone number is same but different email. Now for record 2 and 3 email is same and so on similarly for record 4. So all 1,2,3,4 record get same ID. I'm not concerned about getting the name correct, only give them an id – Pulkit Jun 30 '19 at 13:26
  • @Pulkit I see. I have updated the answer accordingly. Can you check if it works for your case now? – Ronak Shah Jul 01 '19 at 03:08
  • it works great unless I have blanks or 0 in email or phone number, I would like to remove them, probably another nested if else might help – Pulkit Jul 01 '19 at 18:16
  • @Pulkit okay..I added another check and replace the `ID`s with `NA` for those values. If not needed you can remove them later. – Ronak Shah Jul 02 '19 at 03:03