2

I have a dataframe that I want to group based on the values in a column.

The trick is though there are some rows that need be merged if the values have already appeared in a column that I am grouping by.

For example:

df <- data.frame(col1 = c("R1", "R2", "R2", "R2", "R2", "R4", "R5", "R5", "R5"),
                 col2 = c("R10", "R4", "R5", "R6", "R7", "R5", "R6", "R7", "R9"), stringsAsFactors = FALSE)

df2 <- aggregate(col2 ~ col1, df, FUN = function(x) paste(unique(x), collapse = ", "))

> df
  col1 col2
1   R1  R10
2   R2   R4
3   R2   R5
4   R2   R6
5   R2   R7
6   R4   R5
7   R5   R6
8   R5   R7
9   R5   R9

> df2
  col1           col2
1   R1            R10
2   R2 R4, R5, R6, R7
3   R4             R5
4   R5     R6, R7, R9

R10 will be in group R1 (row 1)

R4, R5, R6 and R7 will be in group R2 (rows 2 to 5) R5 will be in group R4 (row 6)

R6, R7 and R9 will be in group R5 (rows: 7 to 9)

But R4 and R5 are already in R2 so this will stay in R2. For R9 which is originally assigned to R5, needs to be grouped in to R2.

So the desired outcome will be:

> df3
  col1               col2
1   R1                R10
2   R2 R4, R5, R6, R7, R9

OR preferably:

1 col1 col2
2   R1  R10
3   R2   R4
4   R2   R5
5   R2   R6
6   R2   R7
7   R2   R9
MKa
  • 2,248
  • 16
  • 22
  • `For R9 which is originally assigned to R5, needs to be grouped in to R2.` why? How? – Ronak Shah Jul 23 '19 at 02:49
  • Because R5 is included in R2. If you look at `df2` `col2` R5 is mapped to R2. – MKa Jul 23 '19 at 02:56
  • 2
    Check out the *igraph* package - this is a clustering problem essentially if you can be flexible with your output. E.g.: `library(igraph); g <- graph.data.frame(df); clusters(g)` – thelatemail Jul 23 '19 at 03:09
  • 1
    related: https://stackoverflow.com/questions/56548395/find-unique-set-of-identifiers-groups-among-several-columns https://stackoverflow.com/questions/45079559/make-a-group-indices-based-on-several-columns https://stackoverflow.com/questions/56740990/fast-way-to-group-variables-based-on-direct-and-indirect-similarities-in-multipl – chinsoon12 Jul 23 '19 at 03:28
  • Wow, `igraph` package is great. Thanks! – MKa Jul 23 '19 at 05:18

2 Answers2

3

An option would be to replace the values based on the intersecting elements and then do the aggregate

i1 <- df$col1 %in% df$col2
df$col1[i1] <- df$col1[match(df$col1[inds], df$col2)]
aggregate(col2 ~ col1, unique(df), FUN = toString)
#   col1               col2
#1   R1                R10
#2   R2 R4, R5, R6, R7, R9

Or with tidyverse

library(dplyr)
library(stringr)
df %>% 
    group_by(col1 = case_when(col1 %in%  intersect(col1, col2) ~ "R2", 
                   TRUE ~ col1)) %>% 
    distinct %>% 
    summarise(col2 = toString(col2))
# A tibble: 2 x 2
#  col1  col2              
#  <chr> <chr>             
#1 R1    R10               
#2 R2    R4, R5, R6, R7, R9
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks akrun. In the actual dataset, I have more rows and more groups to be defined so I prefer not to hardcode `R2` if possible. Would there be a way to replace this line? – MKa Jul 23 '19 at 03:03
  • Thanks, this works on this example - in the actual dataset I've got, this still generates few overlapping groups but I think I can work from here. – MKa Jul 23 '19 at 05:17
1

One option with base R can be

inds <- df$col1 %in% df$col2
df$col1[inds] <- df$col1[match(df$col1[inds], df$col2)]

and then we can take only unique values of dataframe

unique(df)

#  col1 col2
#1   R1  R10
#2   R2   R4
#3   R2   R5
#4   R2   R6
#5   R2   R7
#9   R2   R9

Or if you want comma-separated string

aggregate(col2 ~ col1, unique(df), toString)

#  col1               col2
#1   R1                R10
#2   R2 R4, R5, R6, R7, R9
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213