Create ID for duplicated of different column

Question

I tried a lot on this concept,

I have a data.frame with 3 column name Custname, Email, Mobile_no:

Custname    Email         Mobile_no
aaa        xyz@gmail.com    987
bbb        xyz@gmail.com    123
ccc        xyz0@gmail.com   987
ddd        abc@gmail.com    123
eee        xyz0@gmail.com   100

Actually these customers are in the same group because the mobile number and e-mail are the same across the column however different customer names.

I want to create a unique ID who are all having the same mobile's email match with email column and same email's mobile match with mobile number column.

I tried a lot by using a duplicated and paste function.

As an additional explanation, I don't want group_by customer, actually 987 came in two different customers(aaa and CCC) but that two different customers having different email xyz and xyz. In this one mail xyz matched with customer eee, so (aaa ccc eee) all are same group, like wise email.

Thank you for help

Can you show when the data is not in same group? Not clear. Do you need `df1 %>% group_by(Custname) %>% mutate(grp = n_distinct(paste(Email, Mobile_no)))` — akrun, Aug 29 '18 at 14:15
What did you try? There are many questions like this on stackoverflow. — S Rivero, Aug 29 '18 at 14:16
Possible duplicate of [Assign unique ID based on two columns](https://stackoverflow.com/questions/42921674/assign-unique-id-based-on-two-columns) — S Rivero, Aug 29 '18 at 14:17
@Akun, I don't want group_by customer, actually 987 came in two different customers(aaa and CCC) but that two different customers having different email xyz and xyz0 .in this one mail xyz0 matched with customer eee ,so( aaa ccc eee )all are same group .like wise email — Prabhu, Aug 29 '18 at 15:00
@Prabhu, could you provide in your question the desired behavior, a specific problem or error and the shortest code necessary to reproduce it in the question itself. Questions without a clear problem statement are not useful to other readers. See: [How to create a Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve)." — Artem, Sep 22 '18 at 12:12

Artem · Answer 1 · 2018-09-23T19:27:02.217

In fact your problem is graph separation for components. In your case vertices of graphs are persons. Based on attribute information, i.e. e-mail and phone number, you can establish relationships which are edges.

It looks like simple methods like paste or duplicate or group_by are not effective as you can have rather complicated paths. As you explained however person D and person E have completely different contacts, in fact they are connected through person C hence should have the same ID.

Or in other words some person regesterd on site with e-mail A and mobile B. Then he lost the phone. And registered with mobile C. Then he forgot his password and registered with e-mail D. In the end we have the person with e-mail D and and mobile C. For some unknown reason he registered by different names.

You may have even more complicated relationship pathsway.

The algorithm below is using igraph to make an undirected graph based on adjacency matrix created on your condition. After it identifies not connected components, extract it and merge with initial data.frame. As there was not enough data in your example the simulation was used.

Simulated Input:

   name tel    email
1   AAA 222 F@xy.com
2   BBB 555 C@xy.com
3   CCC 333 E@xy.com
4   DDD 666 D@xy.com
5   EEE 666 A@xy.com
6   FFF 111 F@xy.com
7   GGG 444 B@xy.com
8   HHH 666 A@xy.com
9   III 444 B@xy.com
10  JJJ 333 F@xy.com

Code

library(igraph)
set.seed(123)
n <- 10

# simulation
df <- data.frame(
  name = sapply(1:n, function(i) paste0(rep(LETTERS[i], 3), collapse = "")),
  tel = sample(1:6, n, replace = TRUE) * 111,
  email = paste0(sample(LETTERS[1:6], n, replace = TRUE), "@xy.com")
)

# adjacency matrix preparation
df1 <- expand.grid(df$name, df$name)
names(df1) <- c("name_x", "name_y")

df1 <- merge(df1, df, by.x = "name_x", by.y = "name")
df1 <- merge(df1, df, by.x = "name_y", by.y = "name")
df1$con <- ifelse(with(df1, tel.x == tel.y | email.x == email.y), 1, 0)

stats::reshape(df1[, c(1, 2, 7)], idvar = "name_x", timevar = "con", direction = "wide")
#v.names = , timevar = "numbers", direction = "wide")
library(igraph)


library(reshape2)
m <- dcast(df1[, c(1, 2, 7)], name_y ~ name_x)
rownames(m) <- m[, 1]
m[, 1] <- NULL
m <- as.matrix(m) 
diag(m) <- 0

# graph creation
g1 <- graph_from_adjacency_matrix(m, mode = "undirected")
gcmps <- groups(components(g1))

# groups extraction
ids <- unlist(mapply(function(x, y) paste0(x, "_", y), seq_along(gcmps), gcmps))
df_ids <- as.data.frame(t(sapply(ids, function(x) unlist(strsplit(x, "_")))))
names(df_ids) <- c("id", "name")

# data merging
result <- merge(df, df_ids)
result

Output:

   name tel    email
1   AAA 222 F@xy.com
2   BBB 555 C@xy.com
3   CCC 333 E@xy.com
4   DDD 666 D@xy.com
5   EEE 666 A@xy.com
6   FFF 111 F@xy.com
7   GGG 444 B@xy.com
8   HHH 666 A@xy.com
9   III 444 B@xy.com
10  JJJ 333 F@xy.com

Relationship Graph (only first letters of name were taken)

Million Thanks Artem. its very useful and its working properly. Now your creating a id based on the name match,In my data sets name are not same ,may be the name spellings are different. So would please help to create id without a name with same logic. my data sets have only four column,Code,Tele,Email,account number. — Prabhu, Oct 01 '18 at 16:31
@Prabhu, my pleasure, actually you are asking another question in the comment. It is better to post a new question. I'd recommend you to follow [How to make a great reproducable example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). — Artem, Oct 02 '18 at 12:33

Create ID for duplicated of different column

1 Answers1