In fact your problem is graph separation for components. In your case vertices of graphs are persons. Based on attribute information, i.e. e-mail and phone number, you can establish relationships which are edges.
It looks like simple methods like paste
or duplicate
or group_by
are not effective as you can have rather complicated paths. As you explained however person D and person E have completely different contacts, in fact they are connected through person C hence should have the same ID.
Or in other words some person regesterd on site with e-mail A and mobile B. Then he lost the phone. And registered with mobile C. Then he forgot his password and registered with e-mail D. In the end we have the person with e-mail D and and mobile C. For some unknown reason he registered by different names.
You may have even more complicated relationship pathsway.
The algorithm below is using igraph
to make an undirected graph based on adjacency matrix created on your condition. After it identifies not connected components, extract it and merge with initial data.frame
. As there was not enough data in your example the simulation was used.
Simulated Input:
name tel email
1 AAA 222 F@xy.com
2 BBB 555 C@xy.com
3 CCC 333 E@xy.com
4 DDD 666 D@xy.com
5 EEE 666 A@xy.com
6 FFF 111 F@xy.com
7 GGG 444 B@xy.com
8 HHH 666 A@xy.com
9 III 444 B@xy.com
10 JJJ 333 F@xy.com
Code
library(igraph)
set.seed(123)
n <- 10
# simulation
df <- data.frame(
name = sapply(1:n, function(i) paste0(rep(LETTERS[i], 3), collapse = "")),
tel = sample(1:6, n, replace = TRUE) * 111,
email = paste0(sample(LETTERS[1:6], n, replace = TRUE), "@xy.com")
)
# adjacency matrix preparation
df1 <- expand.grid(df$name, df$name)
names(df1) <- c("name_x", "name_y")
df1 <- merge(df1, df, by.x = "name_x", by.y = "name")
df1 <- merge(df1, df, by.x = "name_y", by.y = "name")
df1$con <- ifelse(with(df1, tel.x == tel.y | email.x == email.y), 1, 0)
stats::reshape(df1[, c(1, 2, 7)], idvar = "name_x", timevar = "con", direction = "wide")
#v.names = , timevar = "numbers", direction = "wide")
library(igraph)
library(reshape2)
m <- dcast(df1[, c(1, 2, 7)], name_y ~ name_x)
rownames(m) <- m[, 1]
m[, 1] <- NULL
m <- as.matrix(m)
diag(m) <- 0
# graph creation
g1 <- graph_from_adjacency_matrix(m, mode = "undirected")
gcmps <- groups(components(g1))
# groups extraction
ids <- unlist(mapply(function(x, y) paste0(x, "_", y), seq_along(gcmps), gcmps))
df_ids <- as.data.frame(t(sapply(ids, function(x) unlist(strsplit(x, "_")))))
names(df_ids) <- c("id", "name")
# data merging
result <- merge(df, df_ids)
result
Output:
name tel email
1 AAA 222 F@xy.com
2 BBB 555 C@xy.com
3 CCC 333 E@xy.com
4 DDD 666 D@xy.com
5 EEE 666 A@xy.com
6 FFF 111 F@xy.com
7 GGG 444 B@xy.com
8 HHH 666 A@xy.com
9 III 444 B@xy.com
10 JJJ 333 F@xy.com
Relationship Graph (only first letters of name were taken)
