Create a Group Key across two columns

Question

I'm trying to solve the below problem but I find it difficult to explain. I want to assign an incremental value based on the linkage between two columns (Colours & Letters).

Colours <- c("Green","Red","Green","Green","Blue","Red","Brown")
Letters <- c("X","C","Y","A","C","T","P")
df <- data.frame(Colours,Letters)
df

    Colours Letters
1   Green       X
2     Red       C
3   Green       Y
4   Green       A
5    Blue       C
6     Red       T
7   Brown       P

I'll assign a value to Group so that all identical Colours are in the same Group, along with any other Colour that shares the same Letter(s). For example, Group 2 includes Red & Blue given the shared linkage to Letter C.

Group <- c(1,2,1,1,2,2,3)
df <- data.frame(df,Group)
df
    Colours Letters Group
1   Green       X     1
2     Red       C     2
3   Green       Y     1
4   Green       A     1
5    Blue       C     2
6     Red       T     2
7   Brown       P     3

If an additional row was added with Colour = Green and Letter = C then the Group column would change to the below. All Greens would be grouped together with any other Colour (e.g. Red) that shared the same Letter (C in the case of Red). Furthermore, any Colour that shared a Letter with Red would likewise be added to the same Group as Green (such is the case for Blue, which shares the Letter C with Red).

  Colours Letters Group
1   Green       X     1
2     Red       C     1
3   Green       Y     1
4   Green       A     1
5    Blue       C     1
6     Red       T     1
7   Brown       P     2
8   Green       C     1

Can anyone help?

What value would be assigned to `Green` & `C` if they were in line 8? — markus, Nov 07 '18 at 11:38
Good question - I hadn't considered this. I've updated the question. — Wilze, Nov 07 '18 at 12:47
I think you can use graph theory and make your group ID by tagging connected components, somewhat like this: https://stackoverflow.com/q/36659114 — Frank, Nov 07 '18 at 13:05

gfgm · Answer 1 · 2018-11-07T13:18:42.040

As the @Frank above noted, you are describing a graph problem in that you want your group label to reflect connected components -- colours that share a letter. By converting your columns into a graph object you can figure out what the separate components are and return these as groups:

Colours <- c("Green","Red","Green","Green","Blue","Red","Brown")
Letters <- c("X","C","Y","A","C","T","P")
df <- data.frame(Colours,Letters)

Group <- c(1,2,1,1,2,2,3)
df <- data.frame(df,Group)

# load the igraph package for working with graphs
library(igraph)
adj.mat <- table(df$Colours, df$Letters) %*% t(table(df$Colours, df$Letters))

# visual inspection makes it clear what the components are
g <- graph_from_adjacency_matrix(adj.mat, mode = 'undirected', diag = F)
plot(g)

# we create a dataframe that matches each color to a component
mdf <- data.frame(Group_test = components(g)$membership,
                  Colours = names(components(g)$membership))

mdf
#>       Group_test Colours
#> Blue           1    Blue
#> Brown          2   Brown
#> Green          3   Green
#> Red            1     Red

# Then we just match them together
dplyr::left_join(df, mdf)
#> Joining, by = "Colours"
#>   Colours Letters Group Group_test
#> 1   Green       X     1          3
#> 2     Red       C     2          1
#> 3   Green       Y     1          3
#> 4   Green       A     1          3
#> 5    Blue       C     2          1
#> 6     Red       T     2          1
#> 7   Brown       P     3          2

Clearly the groups have a different numbering but split the colours similarly.

We can look at the extended case as a sanity check, where we add a linking color that reduces the set of components to 2:

# examining the extended case as a check
df2 <- data.frame(Colours = c(Colours, "Green"), Letters = c(Letters, "C"))
df2
#>   Colours Letters
#> 1   Green       X
#> 2     Red       C
#> 3   Green       Y
#> 4   Green       A
#> 5    Blue       C
#> 6     Red       T
#> 7   Brown       P
#> 8   Green       C

# lets wrap the procedure in a function for convenience
getGroup <- function(col, let, plot = FALSE){
  adj.mat <- table(col, let) %*% table(let, col)
  g <- graph_from_adjacency_matrix(adj.mat, mode = 'undirected',
                                   diag = F)
  if (plot) {plot(g)}
  comps <- components(g)$membership
  mdf <- data.frame(Group = comps, Colours = names(comps))
  mdf
}

# we get our desired group key (which we can merge back to the dataframe)
getGroup(df2$Colours, df2$Letters)
#>       Group Colours
#> Blue      1    Blue
#> Brown     2   Brown
#> Green     1   Green
#> Red       1     Red

^{Created on 2018-11-07 by the reprex package (v0.2.1)}

Thanks, gfgm. I read your answer about 4 times over the course of the day and I think I finally get it. Is this called network analysis? If so, I might need to watch a few tutorials to consolidate my understanding. Anyway, thanks for your help! — Wilze, Nov 08 '18 at 05:32
hi @Wilze, sorry I wasn't clearer -- looking at the response now I see I went a little fast. Yes this is a network analysis. We build a matrix of letters and colours: `table(Colours, Letters)`. And then we can obtain counts of how many Letters each Colour has in common with a matrix product: `A %*% t(A)`, where `t(A)` is the transpose of A. The resulting matrix will show in the ijth cell how many times a colour is linked to the same letter. This structure is a network adjacency matrix and that is what I build in the line `table(df$Colours, df$Letters) %*% t(table(df$Colours, df$Letters))` — gfgm, Nov 08 '18 at 08:20

Create a Group Key across two columns

1 Answers1