1

I am trying to group similar entities together and can't find an easy way to do so.

For example, here is a table:

                  Names Initial_Group Final_Group
1          James,Gordon             6           A
2          James,Gordon             6           A
3          James,Gordon             6           A
4          James,Gordon             6           A
5          James,Gordon             6           A
6          James,Gordon             6           A
7                Amanda             1           A
8                Amanda             1           A
9                Amanda             1           A
10        Gordon,Amanda             5           A
11        Gordon,Amanda             5           A
12        Gordon,Amanda             5           A
13        Gordon,Amanda             5           A
14        Gordon,Amanda             5           A
15        Gordon,Amanda             5           A
16        Gordon,Amanda             5           A
17        Gordon,Amanda             5           A
18 Edward,Gordon,Amanda             4           A
19 Edward,Gordon,Amanda             4           A
20 Edward,Gordon,Amanda             4           A
21                 Anna             2           B
22                 Anna             2           B
23                 Anna             2           B
24         Anna,Leonard             3           B
25         Anna,Leonard             3           B
26         Anna,Leonard             3           B

I am unsure how to get the 'Final_Group' field, in the table above.

For that, I need to assign any element that has any connections to another element, and group them together:

  • For example, rows 1 to 20 needs to be grouped together because they are all connected by at least one or more elements.

  • So for rows 1 to 6, 'James, Gordon' appear, and since "Gordon" is in rows 10:20, they all have to be grouped. Likewise, since 'Amanda' appears in rows 7:9, these have to be grouped with "James,Gordon", "Gordon, Amanda", and "Edward, Gordon, Amanda".

Below is code to generate the initial data:

# Manually generating data
Names <- c(rep('James,Gordon',6)
          ,rep('Amanda',3)
          ,rep('Gordon,Amanda',8)
          ,rep('Edward,Gordon,Amanda',3)
          ,rep('Anna',3)
          ,rep('Anna,Leonard',3))
Initial_Group <- rep(1:6,c(6,3,8,3,3,3))
Final_Group <- rep(c('A','B'),c(20,6))
data <- data.frame(Names,Initial_Group,Final_Group)

# Grouping
data %>%
  select(Names) %>%
  mutate(Initial_Group=group_indices(.,Names))

Does anyone know of anyway to do this in R?

KedAU
  • 15
  • 4
  • 1
    Sounds like a graph problem. You will need to split each comma separated name to a new row so you have a `Name--Initial_Group` relationship running down the page (e.g. - https://tidyr.tidyverse.org/reference/separate_rows.html ), then you can find the clusters - https://stackoverflow.com/questions/12135971/identify-groups-of-linked-episodes-which-chain-together – thelatemail Aug 27 '21 at 04:19
  • 1
    Oh cool, thank you for this. I tried it with the data, and it works! – KedAU Aug 27 '21 at 04:55

2 Answers2

0

I was wrong that I misunderstood that you're focus on Final_Group. If not, please let me know My approach is based on distance between samples.

data <- data %>%
  mutate(Names = sapply(Names, function(x) as.vector(str_split(x, ","))))


for (i in c(1:26)){
  data$James[i] = ("James" %in% data$Names[[i]]) 
  data$Gordon[i] = ("Gordon" %in% data$Names[[i]])
  data$Amanda[i] = ("Amanda" %in% data$Names[[i]])
  data$Edward[i] = ("Edward" %in% data$Names[[i]])
  data$Anna[i] = ("Anna" %in% data$Names[[i]]) 
  dummy$Leonard[i] = ("Leonard" %in% dummy$Names[[i]])   
}
hc <- data%>% select(-Names,) %>%
  select(-Final_Group, -Initial_Group ) %>%
  dist() %>% hclust(.,method = "complete") 
cutree(hc)
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2
plot(hc)

enter image description here

now that's similar to Final_Group

enter image description here

Park
  • 14,771
  • 6
  • 10
  • 29
  • Hey Park, Thanks for the reply. The only thing is that I don't have the "Final_Group" column. That's the one I need to generate. – KedAU Aug 27 '21 at 04:53
  • @KedAU In last vector by cutree(hc), change 1 -> A and 2 -> B will be Final_Group. I think that code need more edit, my apologies. – Park Aug 27 '21 at 04:59
  • @KedAU Sample 1~20 and 21~26 are separated into different clusters. – Park Aug 27 '21 at 05:02
0

This is a long one but you could do:

library(tidyverse)
library(igraph)

df %>%
  select(Names)%>%
  distinct() %>%
  separate(Names, c('first', 'second'), extra = 'merge', fill = 'right')%>%
  separate_rows(second) %>%
  mutate(second = coalesce(second, as.character(cumsum(is.na(second)))))%>%
  graph_from_data_frame()%>%
  components()%>%
  getElement('membership')%>%
  imap(~str_detect(df$Names, .y)*.x) %>%
  invoke(pmax, .)%>%
  cbind(df, value = LETTERS[.], value1 = .)

                  Names Initial_Group Final_Group value value1
1          James,Gordon             6           A     A      1
2          James,Gordon             6           A     A      1
3          James,Gordon             6           A     A      1
4          James,Gordon             6           A     A      1
5          James,Gordon             6           A     A      1
6          James,Gordon             6           A     A      1
7                Amanda             1           A     A      1
8                Amanda             1           A     A      1
9                Amanda             1           A     A      1
10        Gordon,Amanda             5           A     A      1
11        Gordon,Amanda             5           A     A      1
12        Gordon,Amanda             5           A     A      1
13        Gordon,Amanda             5           A     A      1
14        Gordon,Amanda             5           A     A      1
15        Gordon,Amanda             5           A     A      1
16        Gordon,Amanda             5           A     A      1
17        Gordon,Amanda             5           A     A      1
18 Edward,Gordon,Amanda             4           A     A      1
19 Edward,Gordon,Amanda             4           A     A      1
20 Edward,Gordon,Amanda             4           A     A      1
21                 Anna             2           B     B      2
22                 Anna             2           B     B      2
23                 Anna             2           B     B      2
24         Anna,Leonard             3           B     B      2
25         Anna,Leonard             3           B     B      2
26         Anna,Leonard             3           B     B      2

Check the column called value

Onyambu
  • 67,392
  • 3
  • 24
  • 53