4

I want to find an efficient method of determining an entire hierarchy type relationship for a table of number pairs, then express that relationship in a vector, or string, so that I can determine other useful information about each pair's hierarchy, such as the highest related integer, lowest related integer and total number of related integers.

For example I have a table of integer pairs:

  X    Y       
 ---   ---
  5    10
  5    11
  11   12
  11   13
  13   3
  20   18
  17   18
  50   18
  20   21

A record is related to another record if any value in the pair is shared by any other value in another pair. The final table would look something like this:

  X    Y    Related ID's
  ---  ---  ---------------   
  5    10    3,5,10,11,12,13 
  5    11    3,5,10,11,12,13 
  11   12    3,5,10,11,12,13 
  11   13    3,5,10,11,12,13 
  13   3     3,5,10,11,12,13 
  20   18    17,18,20,21,50
  17   18    17,18,20,21,50
  50   18    17,18,20,21,50
  20   21    17,18,20,21,50

What I have now is admittedly a mess. It uses a fuzzy_join with a matching function that takes x,y as a vector and does a match between them. That match then creates a larger vector of all four matching numbers, which goes back into the fuzzy_join to do the match again. This loops until there are no more matches. It gets terrible very quickly, and at about 4k records it just doesn't respond anymore. The entire initial table of pairs will stay < 100k records

amz2
  • 63
  • 3
  • Have you forgotten to add the code to be checked? – jay.sf Aug 07 '20 at 16:05
  • You may treat it as a graph problem. `library(igraph)`; `g = graph_from_data_frame(d)`; `components(g)$membership`; `plot(g)` – Henrik Aug 07 '20 at 16:15
  • I figured what I had was so far off base it probably wasn't useful. – amz2 Aug 07 '20 at 16:18
  • Related: [Make a group_indices based on several columns](https://stackoverflow.com/questions/45079559/make-a-group-indices-based-on-several-columns/); [Fast way to group variables based on direct and indirect similarities in multiple columns](https://stackoverflow.com/questions/56740990/fast-way-to-group-variables-based-on-direct-and-indirect-similarities-in-multipl) – Henrik Aug 07 '20 at 17:12

5 Answers5

6

in base R you could do:

relation <- function(dat){
  .relation <- function(x){
    k = unique(sort(c(dat[dat[, 1] %in% x, 2], x, dat[dat[, 2] %in% x, 1])))
    if(setequal(x,k)) toString(k) else .relation(k)}
  sapply(dat[,1],.relation)
}

df$related <- relation(df)

df
   X  Y              related
1  5 10 3, 5, 10, 11, 12, 13
2  5 11 3, 5, 10, 11, 12, 13
3 11 12 3, 5, 10, 11, 12, 13
4 11 13 3, 5, 10, 11, 12, 13
5 13  3 3, 5, 10, 11, 12, 13
6 20 18   17, 18, 20, 21, 50
7 17 18   17, 18, 20, 21, 50
8 50 18   17, 18, 20, 21, 50
9 20 21   17, 18, 20, 21, 50

If you have igraph installed you could do:

library(igraph)
a <- components(graph_from_data_frame(df, FALSE))$membership
b <- tapply(names(a),a,toString)
df$related <- b[a[as.character(df$X)]]

EDIT:

If we are comparing the speed of the functions, then note that in my function above, the last statement ie sapply(dat[,1], ...) computes the values for each element even after computing it before. eg sapply(c(5,5,5,5)...) will compute the group 4 times instead of just once. Now use:

relation <- function(dat){
  .relation <- function(x){
    k <- unique(c(dat[dat[, 1] %in% x, 2], x, dat[dat[, 2] %in% x, 1]))
    if(setequal(x,k)) sort(k) else .relation(k)}
  d <- unique(dat[,1])
  m <- setNames(character(length(d)),d)
  while(length(d) > 0){
    s <- .relation(d[1])
    m[as.character(s)] <- toString(s)
    d <- d[!d%in%s]
  }
  dat$groups <- m[as.character(dat[,1])]
  dat
}

Now do the benchmark:

 df1 <- do.call(rbind,rep(list(df),100))
 microbenchmark::microbenchmark(relation(df1), group_pairs(df1),unit = "relative")


 microbenchmark::microbenchmark(relation(df1), group_pairs(df1))
Unit: milliseconds
             expr      min        lq       mean    median       uq      max neval
    relation(df1)   1.0909   1.17175   1.499096   1.27145   1.6580   3.2062   100
 group_pairs(df1) 153.3965 173.54265 199.559206 190.62030 213.7964 424.8309   100
Onyambu
  • 67,392
  • 3
  • 24
  • 53
2

Another option with igraph

library(igraph)
clt <- clusters(graph_from_data_frame(df,directed = FALSE))$membership
within(df, ID <- ave(names(clt),clt,FUN = toString)[match(as.character(X),names(clt))])

such that

   X  Y                   ID
1  5 10 5, 11, 13, 10, 12, 3
2  5 11 5, 11, 13, 10, 12, 3
3 11 12 5, 11, 13, 10, 12, 3
4 11 13 5, 11, 13, 10, 12, 3
5 13  3 5, 11, 13, 10, 12, 3
6 20 18   20, 17, 50, 18, 21
7 17 18   20, 17, 50, 18, 21
8 50 18   20, 17, 50, 18, 21
9 20 21   20, 17, 50, 18, 21
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
1

This is nowhere near as elegant as Onyambu's base R answer, but I benchmark it as about 4 or 5 times faster. It works by assigning each row to a group, adding its contents to the set of all numbers in that group, then finding the next unassigned row that has at least one member in the set. Once there are no more matching rows, it jumps to the next unassigned row.

group_pairs <- function(df)
{
  df$ID <- numeric(nrow(df))
  ID <- 1
  row <- 1
  current_set <- numeric()
  
  while(any(df$ID == 0))
  {
    
    df$ID[row]  <- ID
    current_set <- unique(c(current_set, df$x[row], df$y[row]))
    nextrows    <- c(which(df$x %in% current_set & df$ID == 0), 
                     which(df$y %in% current_set & df$ID == 0))
    if (length(nextrows) > 0)
    {
      row <- unique(nextrows)[1]
    }
    else
    {
      ID <- ID + 1
      row <- which(df$ID == 0)[1]
      current_set <- numeric()
    }
  }
  
  df$ID <- sapply(split(df[-3], df$ID), 
                  function(i) paste(sort(unique(unlist(i))), collapse = ", "))[df$ID]
  df
}

So you can do:

group_pairs(df)
#>    x  y                   ID
#> 1  5 10 3, 5, 10, 11, 12, 13
#> 2  5 11 3, 5, 10, 11, 12, 13
#> 3 11 12 3, 5, 10, 11, 12, 13
#> 4 11 13 3, 5, 10, 11, 12, 13
#> 5 13  3 3, 5, 10, 11, 12, 13
#> 6 20 18   17, 18, 20, 21, 50
#> 7 17 18   17, 18, 20, 21, 50
#> 8 50 18   17, 18, 20, 21, 50
#> 9 20 21   17, 18, 20, 21, 50

and

microbenchmark::microbenchmark(relation(df), group_pairs(df))
#> Unit: milliseconds
#>             expr      min       lq     mean   median       uq      max neval cld
#>     relation(df) 4.535100 5.027551 5.737164 5.829652 6.256301 7.669001   100   b
#>  group_pairs(df) 1.022502 1.159601 1.398604 1.338501 1.458950 8.903800   100  a 
Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • I guess the `igraph` solution would be the fastest if you put it in benchmark – ThomasIsCoding Aug 07 '20 at 16:46
  • @ThomasIsCoding you'd hope so. I haven't tried. Writing this algorithm in C++ would be pretty straightforward and might be the best option for large data sets. – Allan Cameron Aug 07 '20 at 16:52
  • 1
    Here is a reference for benchmarking in a similar problem https://stackoverflow.com/questions/58838981/grouping-factors-or-integers-into-equivalence-classes-in-r. Anyway, upvoted for your solution :) – ThomasIsCoding Aug 07 '20 at 16:55
  • @ThomasIsCoding I can't get Onyambu's igraph solution to work – Allan Cameron Aug 07 '20 at 16:55
  • Sorry but check my code. it is not efficient. I am repeating to compute the same thing more than once. Let me make it efficient then you could do the comparison – Onyambu Aug 07 '20 at 16:56
  • Well, I ran Onyambu's code without problem. What error do you encounter when? – ThomasIsCoding Aug 07 '20 at 17:11
  • @ThomasIsCoding it tells me there are 11 elements to replace but the data only has 9. I must have made a mistake. I'll look again. – Allan Cameron Aug 07 '20 at 17:20
  • @ThomasIsCoding yes - I'd made an error with column names. According to my benchmark, the igraph function is a little slower than mine - not sure how it would scale to larger sets though. – Allan Cameron Aug 07 '20 at 17:29
  • That's great! I made another `igraph` solution, don't know if it also make sense – ThomasIsCoding Aug 07 '20 at 17:30
  • @AllanCameron check the edit. You will note that the code you have is influenced by the dataframe points rather than uniqueness. Also its way too slow.. – Onyambu Aug 07 '20 at 17:33
1

I think you may also do something like this in tidyverse only (I am using an elaborated dataframe with a few added rows). This strategy will keep on accumulate (accumulating) related_ids. Here id is just a rowid and do not has any special purpose. You may drop that step too.

df <- data.frame(X = c(5,5,11,11,13,20, 17,50, 20, 5, 1, 17),
                 Y = c(10, 11, 12, 13, 3, 18, 18, 18, 21, 13, 2, 50))

library(tidyverse)

df %>% arrange(pmax(X, Y)) %>% 
  mutate(id = row_number()) %>% rowwise() %>%
  mutate(related_ids = list(c(X, Y))) %>% ungroup() %>%
  mutate(related_ids = accumulate(related_ids, ~if(any(.y %in% .x)){union(.x, .y)} else {.y})) %>%
  as.data.frame()
#>     X  Y id          related_ids
#> 1   1  2  1                 1, 2
#> 2   5 10  2                5, 10
#> 3   5 11  3            5, 10, 11
#> 4  11 12  4        5, 10, 11, 12
#> 5  11 13  5    5, 10, 11, 12, 13
#> 6  13  3  6 5, 10, 11, 12, 13, 3
#> 7   5 13  7 5, 10, 11, 12, 13, 3
#> 8  17 18  8               17, 18
#> 9  20 18  9           17, 18, 20
#> 10 20 21 10       17, 18, 20, 21
#> 11 50 18 11   17, 18, 20, 21, 50
#> 12 17 50 12   17, 18, 20, 21, 50

Created on 2021-06-01 by the reprex package (v2.0.0)

AnilGoyal
  • 25,297
  • 4
  • 27
  • 45
1

Updated I think you can also use the following solution. It's rather verbose but I think it's quite effective:

library(dplyr)

# First I created an id column to be able to group the observations with any duplicated 
# values
df %>%
  arrange(X, Y) %>%
  mutate(dup = ifelse((X == lag(X, default = 0) | X == lag(Y, default = 0)) |
                        (Y == lag(X, default = 0) | Y == lag(Y, default = 0)) |
                        (X == lag(X, n = 2L, default = 0) | Y == lag(Y, n = 2L, default = 0)) |
                        (X == lag(Y, n = 2L, default = 0) | Y == lag(X, n = 2L, default = 0)) |
                        (X == lag(Y, n = 3L, default = 0) | Y == lag(X, n = 3L, default = 0)) |
                        (X == lag(X, n = 3L, default = 0) | Y == lag(Y, n = 3L, default = 0)), 1, 0)) %>%
  mutate(id = cumsum(dup == 0)) %>%
  select(-dup) -> df1


df1 %>%
  group_by(id) %>%
  pivot_longer(c(X, Y), names_to = "Name", values_to = "Val") %>%
  arrange(Val) %>%
  mutate(dup = Val == lag(Val, default = 10000)) %>%
  filter(!dup) %>%
  mutate(across(Val, ~ paste(.x, collapse = "-"))) %>%
  select(-dup) %>%
  slice(2:n()) %>%
  select(-Name) %>%
  right_join(df1, by = "id") %>%
  group_by(Val, X, Y) %>%
  distinct() %>%
  select(-id) %>%
  relocate(X, Y)

# A tibble: 9 x 3
# Groups:   Val, X, Y [9]
      X     Y Val            
  <int> <int> <chr>          
1     5    10 3-5-10-11-12-13
2     5    11 3-5-10-11-12-13
3    11    12 3-5-10-11-12-13
4    11    13 3-5-10-11-12-13
5    13     3 3-5-10-11-12-13
6    17    18 17-18-20-21-50 
7    20    18 17-18-20-21-50 
8    20    21 17-18-20-21-50 
9    50    18 17-18-20-21-50 

Also I try on @AnilGoyal's elaborate data frame:

# A tibble: 12 x 3
# Groups:   Val, X, Y [12]
       X     Y Val            
   <dbl> <dbl> <chr>          
 1     1     2 1-2            
 2     5    10 3-5-10-11-12-13
 3     5    11 3-5-10-11-12-13
 4     5    13 3-5-10-11-12-13
 5    11    12 3-5-10-11-12-13
 6    11    13 3-5-10-11-12-13
 7    13     3 3-5-10-11-12-13
 8    17    18 17-18-20-21-50 
 9    17    50 17-18-20-21-50 
10    20    18 17-18-20-21-50 
11    20    21 17-18-20-21-50 
12    50    18 17-18-20-21-50 
Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41