12

I have the following tibble,

contact <- tribble(
  ~name, ~phone, ~email,
  'John', 123, 'john_abc@gmail.com',
  'John', 456, 'john_abc@gmail.com',
  'John', 456, 'john_xyz@gmail.com',
  'John', 789, 'john_pqr@gmail.com'
)

I'd like to combine the phone numbers and emails if phone or email are the same, the desired output is the following,

contact_combined <- tribble(
  ~name, ~phone, ~email,
  'John', '123;456', 'john_abc@gmail.com;john_xyz@gmail.com',
  'John', '789', 'john_pqr@gmail.com'
)

I've tried grouping it first by name and phone and then by name and emails but it's not giving me the expected results. I'm stuck on finding an algorithmic way to solve this problem, could someone please give me an advice?

Note: The collapsing of the values in a column is not the question here. It's about selecting the records for the collapsing.

ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
msunij
  • 309
  • 2
  • 8
  • 1
    Create a grouping variable like this: [Make a group_indices based on several columns](https://stackoverflow.com/questions/45079559/make-a-group-indices-based-on-several-columns). Then [this](https://stackoverflow.com/questions/15933958/collapse-concatenate-aggregate-a-column-to-a-single-comma-separated-string-w) – Henrik Feb 17 '23 at 22:06
  • I.e. `grp = components(graph_from_data_frame(contact[ , c(2, 3, 1)]))$membership`; `aggregate(. ~ grp[contact$email], function(x) toString(unique(x)), data = contact)`, as described in the links above. – Henrik Feb 18 '23 at 22:13

4 Answers4

10

Graphs can help with this.

library(igraph)

# creates a matrix which tells whether pairs of vector elements are equal or not
equal_mat <- function(x) {
  
  outer(x, x, '==')
}

m.adj <- equal_mat(contact$phone) | equal_mat(contact$email)
g <- graph_from_adjacency_matrix(m.adj, mode='undir')

t(sapply(split(contact, components(g)$membership), function(group)
  sapply(group, function(column)
    paste(sort(unique(column)), collapse=';')))) %>%
  as_tibble()

# # A tibble: 2 × 3
#   name  phone   email                                
#   <chr> <chr>   <chr>                                
# 1 John  123;456 john_abc@gmail.com;john_xyz@gmail.com
# 2 John  789     john_pqr@gmail.com                   

You can think of your original contacts as a graph, i.e. a set of vertices, one for each row in contact, which are connected by edges if two contacts have the same phone number or email. In your case the graph looks like this, plot(g): Graph of contacts

Contacts 1–3 form one connected component, while the contact number 4 which has no connections is another component. Each such component should be merged into one contact in the final output.

We create the graph from an adjacency matrix m.adj that tells which vertices (nodes) are connected and the graph components are identified using

components(g)$membership
[1] 1 1 1 2

which tells us exactly what we saw above: contacts 1–3 form component one, contact number 4 is component 2. Now we can just collapse the values within each components.

Robert Hacken
  • 3,878
  • 1
  • 13
  • 15
8

I guess igraph would be a good start (by which you can use decompose to cluster connected subgroups)

contact %>%
  select(c(2, 3, 1)) %>%
  graph_from_data_frame() %>%
  decompose() %>%
  lapply(function(x) {
    aggregate(
      . ~ name, get.data.frame(x),
      function(v) toString(unique(v))
    )
  }) %>%
  bind_rows() %>%
  setNames(names(contact))

which gives

  name    phone                                  email
1 John 123, 456 john_abc@gmail.com, john_xyz@gmail.com
2 John      789                     john_pqr@gmail.com

A more tidyverse way (thank @akrun's comment)

contact %>%
  relocate(name, .after = last_col()) %>%
  graph_from_data_frame() %>%
  decompose() %>%
  map(~ .x %>%
    get.data.frame() %>%
    reframe(across(everything(), ~ str_c(unique(.x), collapse = ";")), .by = "name")) %>%
  list_rbind() %>%
  setNames(names(contact))
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
  • 1
    @akrun Aha, amazing! Seems there is still a long path ahead of me for learning `tidyverse`, which looks so powerful! – ThomasIsCoding Feb 17 '23 at 21:04
  • `grp = components(graph_from_data_frame(contact[ , c(2, 3, 1)]))$membership`; `aggregate(. ~ grp[contact$email], function(x) toString(unique(x)), data = contact)` – Henrik Feb 17 '23 at 22:05
4

here is s data.table approach

setDT(contact)
# set keys
setkey(contact, name, phone, email)
# self join on each unique key, filter and summarise on the fly 
ans <- contact[contact, c("phone2", "email2") := {
  temp <- contact[ name == i.name & 
                     (phone %in% contact[name == i.name & email == i.email, ]$phone | 
                        email %in% contact[name == i.name & phone == i.phone, ]$email), ]
  email_temp <- paste0(unique(temp$email), collapse = ";")
  phone_temp <- paste0(unique(temp$phone), collapse = ";")
  list(phone_temp, email_temp)
}, by = .EACHI]
# final step
unique(ans, by = c("name", "phone2", "email2"))[, .(name, phone = phone2, email = email2)]
#    name   phone                                 email
# 1: John 123;456 john_abc@gmail.com;john_xyz@gmail.com
# 2: John     789                    john_pqr@gmail.com

explanation

# so, for the first row, the variable 'temp' is calculated as follows
contact[ name == 'John' &
          (phone %in% contact[name == 'John' & email == 'john_abc@gmail.com', ]$phone | 
           email %in% contact[name == 'John' & phone == 123, ]$email), ]
#    name phone              email
# 1: John   123 john_abc@gmail.com
# 2: John   456 john_abc@gmail.com
# 3: John   456 john_xyz@gmail.com

# then, put the unique emails together in a string using
#     email_temp <- paste0(unique(temp$email), collapse = ";")
# and do the same for the phones using 
#     phone_temp <- paste0(unique(temp$phone), collapse = ";")

# and return there two strings to the columns "phone2" ans "email2"

#repeat for each unique key-combination (.EACHI)
Wimpel
  • 26,031
  • 1
  • 20
  • 37
4

A different approach using the powerjoin package:

contact <- tribble(
  ~name, ~phone, ~email,
  "John", 123, "john_abc@gmail.com",
  "John", 456, "john_abc@gmail.com",
  "John", 456, "john_xyz@gmail.com",
  "John", 789, "john_pqr@gmail.com") |> 
  mutate(row_id = row_number())


library(powerjoin)
library(dplyr)
# check duplicated entries in phone column
phone_check <- contact |>
  power_right_join(filter(contact, duplicated(phone)),
                   by = c("name", "phone"),
                   conflict = ~ paste(.x, .y, sep = ";")
  ) |>
  group_by(phone) |>
  slice(1) |>
  tidyr::separate_rows(row_id) |> 
  ungroup() |> 
  select(name, email, row_id)


# check duplicated entries in email column
email_check <- contact |>
  power_right_join(filter(contact, duplicated(email)),
                   by = c("name", "email"),
                   conflict = ~ paste(.x, .y, sep = ";") 
  ) |>
  group_by(email) |>
  slice(1) |>
  tidyr::separate_rows(row_id) |> 
  ungroup() |> 
  select(name, phone, row_id)



email_check |> select(name, phone, row_id) |> 
  inner_join(phone_check, by = c("name", "row_id")) |> 
  bind_rows(
    contact |> 
      mutate(phone = as.character(phone), 
             row_id = as.character(row_id)) |> 
      filter(!row_id %in% c(phone_check$row_id, email_check$row_id))
  ) |> 
  select(-row_id)


# A tibble: 2 × 3
  name  phone   email                                
  <chr> <chr>   <chr>                                
1 John  123;456 john_abc@gmail.com;john_xyz@gmail.com
2 John  789     john_pqr@gmail.com                   
Julian
  • 6,586
  • 2
  • 9
  • 33