1

I have a dataset of products with two columns representing classifications. I would like to obtain a group id based on the union of the two sets.

The group id has to be transitive in the sense that if class1 is the same for observations 1 and 2, and class2 is equal for 2 and 3, then 1,2, and 3 are equal. In the example, you can see transitivity working in the result where columns 1-4 have the same group_id.

Any tips on how to do it would be appreciated =)

# Example
df <- tribble(
  ~id, ~class1, ~class2,
  1, "A", "L1",
  2, "A", "L1",
  3, "B", "L1",
  4, "B", "L2",
  5, "C", "L3",
  6, "D", "L4")

# Desired output
result <- tribble(
  ~id, ~class1, ~class2, ~group_id,
  1, "A", "L1", 1,
  2, "A", "L1", 1,
  3, "B", "L1", 1, 
  4, "B", "L2", 1, 
  5, "C", "L3", 2,
  6, "D", "L4", 3)
benjasast
  • 87
  • 5

2 Answers2

2
df %>%
  mutate(group_id = 1 + cumsum(!(class1 == lag(class1, default = class1[1]) | 
                                 class2 == lag(class2, default = class2[1]))))
# # A tibble: 6 x 4
#      id class1 class2 group_id
#   <dbl> <chr>  <chr>     <dbl>
# 1     1 A      L1            1
# 2     2 A      L1            1
# 3     3 B      L1            1
# 4     4 B      L2            1
# 5     5 C      L3            2
# 6     6 D      L4            3

(The 1+ was to get it exactly like yours, otherwise without it the first four rows are 0, etc. Not a problem, they still group the same if 0-based or 1-based.)

r2evans
  • 141,215
  • 6
  • 77
  • 149
1

Perhaps, we can use igraph

library(dplyr)
library(igraph)
df %>% 
   select(-id) %>% 
   graph_from_data_frame %>% 
   clusters %>%
   pluck(membership) -> cls
df %>% 
     mutate(group_size = cls[class1])
# A tibble: 6 x 4
#     id class1 class2 group_size
#  <dbl> <chr>  <chr>       <dbl>
#1     1 A      L1              1
#2     2 A      L1              1
#3     3 B      L1              1
#4     4 B      L2              1
#5     5 C      L3              2
#6     6 D      L4              3
akrun
  • 874,273
  • 37
  • 540
  • 662