Selecting the maximum number of values from a column that share values in another column in R

Question

I have a large dataset with sites that were sampled irregularly over 40 years. I want to select the maximum number of sites that share, let’s say, at least 5 years of data.

Any pointers would be appreciated.

Here’s an example dataset:

library(tidyverse)

set.seed(123)

(DF <- tibble(
  Sites = 1:100,
  NYears =rbinom(100, 40, .2)
  ) %>%
  rowwise() %>%
  mutate(Years = list(sample(1982:2021, NYears))) %>%
  unnest(Years) %>%
  select(-NYears))

ThomasIsCoding · Answer 1 · 2023-07-26T23:00:40.470

Disclaimer

The approach below is NOT as efficient as the solution by @jblood94 (so, DON'T use my solution for large dataset if you are pursuing the speed), but just to change the mindset by thinking in the graph-theory way and explore the possibility of using igraph to solve the problem.

Brief Idea (Graph Theory Perspective)

Generally speaking, I think this question can be processed in the graph theory manner and solved by igraph. If you are pursuing the efficiency, you probably need to explore the potential properties hidden behind the graph. For example:

The number of shared Years can be interpreted as the edge weight that is associated with two Sites vertices.
Also, the graph can be further simplified since some edges with weight <=4 can be skipped when searching the cliques. Pruning network and searching afterwards should be more efficient than iterating over all possible combinations.

If you are interested in the details, please refer to the the follow-up answer with code breakdowns.

One `igraph` Approach

Below might be one igraph option to solve the problem (see comments to the code for details): You can try graph_from_adjacency_matrix on Sites and find the cliques using cliques(), e.g.,

res <- DF %>%
    table() %>%
    tcrossprod() %>%
    # build a graph based on the adjacency matrix of `Sites`, where the "weight" attribute denotes the number of shared `Years`
    graph_from_adjacency_matrix(
        "undirected",
        diag = FALSE,
        weighted = TRUE
    ) %>%
    # prune the graph by keeping only the arcs that meet the condition, i.e., >= 5 (share at least 5 years of data)
    subgraph.edges(E(.)[E(.)$weight > 4]) %>%
    # find all cliques
    cliques(min = 2) %>%
    # double check if `Sites` in each clique meet the condition, using full info from `DF`
    Filter(
        \(q) {
            sum(table(with(DF, Years[Sites %in% names(q)])) == length(q)) > 4
        }, .
    ) %>%
    # pick the clique that consists of the maximum number of `Sites`
    `[`(lengths(.) == max(lengths(.)))

or an alternative

res <- DF %>%
    table() %>%
    tcrossprod() %>%
    `>=`(5) %>%
    graph_from_adjacency_matrix(mode = "undirected", diag = FALSE) %>%
    # find all cliques
    cliques(min = 2) %>%
    # double check if `Sites` in each clique meet the condition, using full info from `DF`
    Filter(
        \(q) {
            sum(table(with(DF, Years[Sites %in% names(q)])) == length(q)) >= 5
        }, .
    ) %>%
    # pick the clique that consists of the maximum number of `Sites`
    `[`(lengths(.) == max(lengths(.)))

which gives

> res
[[1]]
+ 3/57 vertices, named, from d7ac134:
[1] 31 59 67

[[2]]
+ 3/57 vertices, named, from d7ac134:
[1] 26 53 84

If you want to further show the shared years, you can take additional actions on top of res, e.g.,

lapply(
    res,
    \(q) {
        list(
            sites = as.integer(names(q)),
            sharedYears = as.integer(names(which(table(with(DF, Years[Sites %in% names(q)])) == length(q))))
        )
    }
)

which gives

[[1]]
[[1]]$sites
[1] 31 59 67

[[1]]$shared_years
[1] 1989 1992 1999 2002 2005


[[2]]
[[2]]$sites
[1] 26 53 84

[[2]]$shared_years
[1] 1989 1991 1998 2001 2011

Discussion

In the igraph options above, cliques() would be the performance bottleneck especially when the condition is "the number of shared Years should be >=k" for small ks, e.g., k=1 or k=2. In those cases, there are significantly more cliques to be enumerated by cliques() before Filter(). You can refer to the benchmarking results by @jblood94.

+1, ah, yes, filtering the cliques--seems like it should have been obvious. I'm running the benchmarks against the matrix solution. — jblood94, Jul 24 '23 at 12:12
Thank you so much! Can you elaborate on the reasoning? I can't follow it completely, but I probably don't know much about graph theory. Sites would be nodes and edges the years? — mferreira, Jul 24 '23 at 13:46
@mferreira Yes, `Sites` are nodes, and "the number of shared years between nodes" is the edge **weight**. — ThomasIsCoding, Jul 24 '23 at 20:29
@mferreira Please see my another answer, where the steps are elaborated. — ThomasIsCoding, Jul 25 '23 at 09:39

jblood94 · Accepted Answer · 2023-07-24T14:44:12.890

Here is an approach starts with pairs of sites that meet the criterion, and then iteratively adds sites to each group. It is pretty efficient, solving the 5 shared years problem on the OP's dataset almost instantly.

There's almost certainly room for improving the efficiency of this approach, but this should give you a good place to start.

Implemented as a function:

library(Matrix)
library(data.table)
library(Rfast) # for rowSort()

f <- function(df, shared) {
  if (shared == 1) {
    # special case
    # get the yeear(s) with the maximum number of sites
    dt <- setorder(setDT(df), Sites, Years)[
      Years %in% df[,.(.N), Years][N == max(N)][[1]],
      setDT(setNames(as.list(Sites), paste0("Site", 1:.N))), Years
    ]
    # combine years with the same set of sites
    setcolorder(
      dt[
        ,.(nShared = .N, Years = .(Years)), eval(names(dt)[-1])
      ], "nShared"
    )
  } else if (shared == 2) {
    # special case
    # create a sparse matrix with year-site pairs
    m <- sparseMatrix(
      match(df$Sites, u2 <- sort(unique(df$Sites))), # sites along rows
      match(df$Years, u1 <- sort(unique(df$Years))), # years along columns
      x = 1L
    )
    # shared sites between year pairs
    m2 <- as(triu(crossprod(m), 1), "TsparseMatrix")
    # return an empty data.table if no pairs meet the criterion
    if (!length(m2@x)) return(
      data.table(nShared = integer(0), Site1 = integer(0), Years = list())
    )
    # find the year pairs that share the maximum number of sites
    i <- which(m2@x == (mx <- max(m2@x)))
    setcolorder(
      # initialize the output
      as.data.table(
        matrix(
          u2[(m[,m2@i[i] + 1L, drop = FALSE]*m[,m2@j[i] + 1L, drop = FALSE])@i + 1L],
          length(i), mx, 1, list(NULL, paste0("Site", 1:mx))
        )
      )[
        , Years := as.list(
          as.data.frame(
            matrix(u1[c(m2@i[i] + 1L, m2@j[i] + 1L)], 2, byrow = TRUE)
          )
        )
        # combine year pairs with the same set of sites
      ][,.(Years = .(unique(unlist(Years)))), eval(paste0("Site", 1:mx))][
        ,nShared := lengths(Years)
      ], "nShared"
    )
  } else {
    # create a sparse matrix with year-site pairs
    m <- sparseMatrix(
      match(df$Years, u1 <- sort(unique(df$Years))), # years along rows
      match(df$Sites, u2 <- sort(unique(df$Sites))), # sites along columns
      x = 1L
    )
    # shared years between site pairs
    m2 <- as(triu(crossprod(m), 1), "TsparseMatrix")
    i <- m2@x >= shared # index of pairings with at least "shared" years shared
    # initialize the output
    dt <- data.table(nShared = m2@x[i], Site1 = m2@i[i] + 1L, Site2 = m2@j[i] + 1L)
    # return an empty data.table if no pairs meet the criterion
    if (!nrow(dt)) return(dt[,Years := list()])
    n <- 2L # current number of sites all sharing "shared" number of years
    
    while (1) {
      # find additional sites to that can be added to each group
      m2 <- as(crossprod(Reduce("*", lapply(2:(n + 1L), \(i) m[,dt[[i]], drop = FALSE])), m), "TsparseMatrix")
      # don't add the same site twice!
      m2[cbind(rep(1:nrow(dt), n), unlist(dt[,2:(n + 1L)]))] <- 0
      i <- m2@x >= shared
      
      if (any(i)) { # an additional site can be added to at least one group
        n <- n + 1L
        # update the outupt
        dt <- unique(dt[m2@i[i] + 1L][,paste0("Site", n) := m2@j[i] + 1L][
          ,nShared := m2@x[i]
        ][
          ,paste0("Site", 1:n) := as.data.frame(rowSort(as.matrix(.SD))),
          .SDcols = 2:(n + 1L)
        ])
      } else break
    }
    
    # convert site indices back to the original values and record the common
    # years for each group
    dt[
      ,Years := .(.(u1[as.logical(Reduce("*", lapply(.SD, \(i) m[,i])))])),
      1:nrow(dt), .SDcols = 2:(n + 1L)
    ][
      ,paste0("Site", 1:n) := as.data.frame(matrix(u2[unlist(.SD)], .N)),
      .SDcols = 2:(n + 1L)
    ]
  }
}

Testing:

f(DF, 1)[]
#>    nShared Site1 Site2 Site3 Site4 Site5 Site6 Site7 Site8 Site9 Site10 Site11 Site12 Site13 Site14 Site15 Site16 Site17 Site18 Site19 Site20 Site21 Site22 Site23 Site24 Site25 Site26 Site27 Site28 Site29 Site30 Site31 Site32 Years
#> 1:       1     1     2     4     5    11    12    13    14    22     38     41     42     46     47     48     54     58     59     62     65     66     69     71     72     75     80     87     88     90     92     93    100  2004
f(DF, 2)[]
#>    nShared Site1 Site2 Site3 Site4 Site5 Site6 Site7 Site8 Site9 Site10 Site11 Site12 Site13     Years
#> 1:       2     3    11    26    31    37    38    49    53    55     64     65     68     84 1989,1991
f(DF, 3)[]
#>    nShared Site1 Site2 Site3 Site4 Site5 Site6          Years
#> 1:       3    24    48    61    87    95    96 1984,2017,2018
#> 2:       3    26    31    37    53    65    84 1989,1991,2001
#> 3:       3    26    31    53    55    64    84 1989,1991,1998
f(DF, 4)[]
#>    nShared Site1 Site2 Site3 Site4               Years
#> 1:       4    24    48    87    95 1984,2014,2017,2018
#> 2:       4    26    31    53    84 1989,1991,1998,2001
#> 3:       4    26    37    53    84 1989,1991,2001,2011
f(DF, 5)[]
#>    nShared Site1 Site2 Site3                    Years
#> 1:       5    26    53    84 1989,1991,1998,2001,2011
#> 2:       5    31    59    67 1989,1992,1999,2002,2005
f(DF, 6)[]
#>     nShared Site1 Site2                             Years
#>  1:       6     5    13     1988,2004,2007,2010,2013,2021
#>  2:       6    11    13     1988,2004,2006,2007,2010,2020
#>  3:       7    26    31 1989,1991,1998,1999,2001,2017,...
#>  4:       6     5    32     1986,1987,1991,2007,2010,2013
#>  5:       6    31    59     1989,1992,1993,1999,2002,2005
#>  6:       6    31    67     1989,1990,1992,1999,2002,2005
#>  7:       6    59    67     1985,1989,1992,1999,2002,2005
#>  8:       6    31    68     1989,1991,1992,2002,2009,2021
#>  9:       6    10    69     1983,1987,1989,1994,1995,2013
#> 10:       6     4    84     1998,2001,2003,2005,2011,2018
#> 11:       6    31    84     1989,1990,1991,1998,2001,2005
#> 12:       6    20    87     1994,2001,2010,2014,2015,2021
#> 13:       7    24    87 1984,2002,2006,2014,2015,2017,...
#> 14:       7    58    87 1986,2004,2005,2006,2014,2017,...
#> 15:       6    68    87     1984,1994,2002,2014,2015,2021
#> 16:       6    72    87     1986,2001,2002,2004,2010,2017
#> 17:       6    31    88     1989,1992,1993,2005,2009,2021
#> 18:       6    33    88     2005,2008,2009,2011,2013,2021
#> 19:       6    82    88     1989,1997,2005,2011,2013,2021
#> 20:       6     8    89     1982,2000,2003,2006,2011,2017
#> 21:       6    24    89     1984,1996,2000,2006,2011,2017
#> 22:       6     5    92     1987,1991,1997,2004,2007,2013
#> 23:       6    21    94     1984,1985,1987,1993,1999,2020
#> 24:       6     8    97     1992,1997,2003,2006,2011,2020
#>     nShared Site1 Site2                             Years
f(DF, 8)[]
#> Empty data.table (0 rows and 4 cols): nShared,Site1,Site2,Years

Benchmarking

The benchmarking will compare this approach to @ThomasIsCoding's igraph solution. However, igraph gets very slow when selecting small values for the minimum number of shared years (1 or 2 for the OP's dataset). These are excluded from the benchmark.

The igraph solution as a function:

library(igraph)

f2 <- function(DF, shared) {
  lapply(
    DF %>%
      table() %>%
      tcrossprod() %>%
      # build a graph based on the adjacency matrix of `Sites`, where the "weight" attribute denotes the number of shared `Years`
      graph_from_adjacency_matrix(
        "undirected",
        diag = FALSE,
        weighted = TRUE
      ) %>%
      # prune the graph by keeping only the arcs that meet the condition, i.e., >= 5 (share at least 5 years of data)
      subgraph.edges(E(.)[E(.)$weight >= shared]) %>%
      # find all cliques
      cliques() %>%
      # double check if `Sites` in each clique meet the condition, using full info from `DF`
      Filter(
        \(q) {
          sum(table(with(DF, Years[Sites %in% names(q)])) == length(q)) >= shared
        }, .
      ) %>%
      # pick the clique that consists of the maximum number of `Sites`
      `[`(lengths(.) == max(lengths(.))),
    \(q) {
      list(
        sites = as.integer(names(q)),
        sharedYears = as.integer(names(which(table(with(DF, Years[Sites %in% names(q)])) == length(q))))
      )
    }
  )
}

Timing:

# The following will run for several minutes without returning a
# result. They will be excluded from the benchmark.
# system.time(f2(DF, 1))
# system.time(f2(DF, 2))

microbenchmark::microbenchmark(
  matrix1 = f(DF, 1),
  # igraph1 = f2(DF, 1),
  matrix2 = f(DF, 2),
  # igraph2 = f2(DF, 2),
  matrix3 = f(DF, 3),
  igraph3 = f2(DF, 3),
  matrix4 = f(DF, 4),
  igraph4 = f2(DF, 4),
  matrix5 = f(DF, 5),
  igraph5 = f2(DF, 5),
  matrix6 = f(DF, 6),
  igraph6 = f2(DF, 6),
  matrix8 = f(DF, 8),
  igraph8 = f2(DF, 8),
  times = 10
)

#> Unit: milliseconds
#>     expr       min        lq       mean     median        uq       max neval
#>  matrix1    4.6677    6.1859    6.61501    6.83370    7.0784    8.2880    10
#>  matrix2    2.4428    2.5075    2.84573    2.80790    3.1315    3.5438    10
#>  matrix3   35.9857   37.0604   43.86820   41.89480   49.0314   61.3006    10
#>  igraph3 5934.9483 6136.9761 6240.44407 6278.98705 6362.0348 6471.6932    10
#>  matrix4   11.7949   12.2703   13.64224   13.99900   14.4769   16.3989    10
#>  igraph4  184.7337  191.0718  210.06073  203.10085  218.6591  255.5529    10
#>  matrix5    6.3713    6.8209    7.23380    7.27330    7.6825    7.9681    10
#>  igraph5   26.8565   29.4616   32.98676   33.06670   35.5976   39.0243    10
#>  matrix6    6.3562    6.5861    7.14332    6.86025    7.6393    8.6200    10
#>  igraph6   10.9108   11.6408   14.39365   12.39585   14.8864   28.6385    10
#>  matrix8    1.1340    1.2327    1.33831    1.33730    1.4175    1.5043    10
#>  igraph8    2.9498    3.6359    3.95806    3.72040    4.3448    5.2509    10

The igraph approach is consistently slower, sometimes very much so.

Data

set.seed(123)

(DF <- tibble(
  Sites = 1:100,
  NYears =rbinom(100, 40, .2)
) %>%
    rowwise() %>%
    mutate(Years = list(sample(1982:2021, NYears))) %>%
    unnest(Years) %>%
    select(-NYears))

this is really clever way of flipping the question on it's head! Instead of going from the year lists to working combinations, (as me and Robert have done), it goes from working combinations to year lists (at least, that is how I understand it) — Mark, Jul 22 '23 at 09:54
if you could add the other answers to your benchmark it'd be really neat (if only to confirm that your yours is the fastest ^_^) — Mark, Jul 22 '23 at 09:56
Excellent effort with `Matrix`,+1! I think `igraph` can do it and you can check my answer. — ThomasIsCoding, Jul 22 '23 at 22:05
Very nice benchmarking, impressive! The reason that `igraph` is significantly slower especially when number of shared years is small is that `clique` needs enumerate more. So `clique` is the bottleneck in my `igraph` approach. — ThomasIsCoding, Jul 24 '23 at 20:24
Yes. That's why I got stuck with `igraph`--I was trying to avoid the combinatorial explosion that the other solutions run into. The iterative approach helps mitigate it because it trims at each iteration. — jblood94, Jul 24 '23 at 20:32
One comment to your solution: You need the prior info about "how many shared years it would be" before you can find the maximal number of `Sites` that can be grouped. For example, given the objective `>=5`, you need to try `f(DF, 5)`, `f(DF,6)`, `f(DF,7)` ... so on and so forth, then you find that `f(DF,5)` gives the desired output. However, what if you have no prior info but there indeed exists some cases like `f(DF, 10)` or `f(DF, k)` with a even higher `k`? Do you need to try out **all** `f(DF,k)` for `k>=5` and then check which gives the max number of sites? — ThomasIsCoding, Jul 24 '23 at 20:40
Not sure I follow. If the objective is `>=5`, `f(DF, 5)` will give the correct result for the question "what is the maximum number of sites that share at least 5 years of data". The answer is 3, and `f(DF, 5)` shows the two sets of 3 sites that answer the question. No need to check `f(DF, 6)`. But notice that `f(DF, 6)` returns sets (of size 2) of sites that share 7 years of data. The answer to the corresponding question is `2` even though some sets share 7 years of data instead of the minimum of 6. — jblood94, Jul 24 '23 at 20:50
Actually I am confused by the presentation of `f(DF, 6)` output since `nShared` appears to be `6` and `7`. If your function `f(DF, k)` aims to enumerate all possible combinations for `>=k` (not `==k`), I think the output `f(DF, 6)` should be a subset of `f(DF, 5)` since the constraint `>=6` is a subset of `>=5`. — ThomasIsCoding, Jul 24 '23 at 21:21
The question is about the *maximum* number of sites that share *at least* `N` years of data. For, `N=5`, that number is `3`, whereas for `N=6` (and `N=7`), that number is `2`. There are no three sites that share more than 5 years of data. However, the answer for both `N=6` and `N=7` is `2`, which is why `f(DF, 7)` is a subset of `f(DF, 6)`. If you check the output of `f2(DF, 6)` (the function using `igraph`), you'll see that it is the same as `f(DF, 6)` (just in a different format). — jblood94, Jul 25 '23 at 00:52
I'm selecting this response as the accepted response mainly due to its efficiency. This method can solve my dataset (2.5K sites, 40 years, 7K rows) for 8 shared years in less than 30 seconds on my PC (i5-8400 CPU @ 2.80GH, 32Gb Ram). The IGraph approach couldn't reach a solution for those 8 shared years overnight in MacStudio with a M1 chip. A solution for 10 shared years was reached in over 30 minutes. Thank you all! — mferreira, Jul 26 '23 at 09:30
I implemented some base R functions with loops, which should work better than `igraph` when with smaller `k` in `f(DF, k)`. https://stackoverflow.com/a/76775485/12158757 — ThomasIsCoding, Jul 26 '23 at 22:51

Mark · Answer 3 · 2023-07-25T12:40:15.500

7

Here is a naïve solution, which, while it probably won't be feasible for your larger dataset, could be a good starting point for finding a better solution:

library(tidyverse)

set.seed(123)

# I create the dataset myself, because I don't want it to be unnested
DF <- tibble(
  Sites = 1:100,
  NYears =rbinom(100, 40, .2)
  ) %>%
  rowwise() %>%
  mutate(Years = list(sort(sample(1982:2021, NYears)))) # sorting the years is good for later when I want to find the combinations, I can be sure that they will be in the same order

# basically, we're doing a crossjoin, filtering to overlaps larger than 5, then generating all possible combinations of those overlaps
overlaps <- cross_join(DF, DF) %>%
  filter(Sites.x < Sites.y) %>%
  mutate(Overlap = list(intersect(Years.x, Years.y))) %>%
  filter(length(Overlap) >= 5) %>%
  mutate(combinations = list(combn(Overlap, 5, simplify = FALSE))) %>%
  select(combinations, Sites.x, Sites.y) %>% 
  unnest(combinations)

most_common_fives <- overlaps %>%
  count(combinations) %>%
  slice_max(n) %>%
  pull(combinations)

overlaps %>%
    filter(combinations %in% most_common_fives) %>%
    group_by(combinations) %>%
    summarise(values = (list(unique(c(Sites.x, c(Sites.y)))))) %>%
    pull(combinations, values) 

$`c(26, 53, 84)`
[1] 1989 1991 1998 2001 2011

$`c(31, 59, 67)`
[1] 1989 1992 1999 2002 2005

edited Jul 25 '23 at 12:40

answered Jul 21 '23 at 13:28

Mark

7,785
2
14
34

1

I didn't go through the code but the result doesn't seem to be correct. With `sites <- c(9, 26, 29, 40, 44, 57, 76, 85, 91, 93, 99); x <- subset(DF, Sites %in% sites); max(table(x$Years))` we only get a maximum of 4 occurrences of one year among the 11 sites. – Robert Hacken Jul 21 '23 at 15:38
oops you're right, I must have accidentally run the first bit more than once fucking up the seed! I updated it with the proper answer – Mark Jul 21 '23 at 23:35
though your code `max(table(x$Years))` doesn't run @RobertHacken – Mark Jul 22 '23 at 00:08
and there were some issues with the second bit- the first bit was getting the right answer, it was the left join which was getting the wrong rows for some strange reason – Mark Jul 22 '23 at 01:43
does it work for you now @RobertHacken ? – Mark Jul 22 '23 at 02:44
1

Yes, now it does (+1). I wouldn't say your solution is naïve, though.. Have you seen mine? :D – Robert Hacken Jul 22 '23 at 09:50
thank you :-) I'm still trying to figure out why your one isn't faster (it is though, far and away the most concise) – Mark Jul 22 '23 at 09:55
Well, going through all the hundreds of thousands of possible combinations is not the fastest thing to do ;-) Anyway, I did some optimizations so now it is much faster (but still probably slower than yours and @jblood94's approach). – Robert Hacken Jul 22 '23 at 13:41
1

`cross_join` is a power technique to find all possible associations, +1! The only concern by this approach might be the memory efficiency when working with large dataset. – ThomasIsCoding Jul 22 '23 at 21:53
1

Thank you so much! Your's solution is the one I can more easly grasp. Cheers – mferreira Jul 24 '23 at 12:49

Robert Hacken · Answer 4 · 2023-07-22T13:37:06.353

Here is a (somewhat) optimized version of my original brute force approach which now doesn't go through all possible combinations of years but instead iteratively builds lists of sites sampled in n+1 years.

yr.sites <- tapply(DF$Sites, DF$Years, identity, simplify=F)
n.yrs <- length(yr.sites)
# year indices of each combination are stored in an attribute 'yrs'
for (i in seq_len(n.yrs)) attr(yr.sites[[i]], 'yrs') <- i
  
add_year <- function(ycs) {
  ycs <- lapply(ycs, \(sites) {
    yrs.comb <- attr(sites, 'yrs')
    last.yr <- tail(yrs.comb, 1)
    if (last.yr < n.yrs) { 
      # find subsets of sites sampled also in another year (for all years 
      #   after the last year already considered)
      lapply(seq(last.yr+1, n.yrs), \(yr) {
        inter <- intersect(sites, yr.sites[[yr]])
        # keep only intersections containing two or more sites
        if (length(inter) > 1) {
          attr(inter, 'yrs') <- c(yrs.comb, yr)
          inter
        } else NULL
      })
    } else NULL
  })
  ycs <- unlist(ycs, recursive=F)
  ycs[lengths(ycs) > 0]
}

yrs.comb.sites <- yr.sites
for (i in 2:5) {
  yrs.comb.sites <- add_year(yrs.comb.sites)
}

Still takes about a second to compute so the other answers are probably more effective, but significantly reduces the number of combinations visited compared to the brute force approach.

And the winners are:

best <- unname(yrs.comb.sites[lengths(yrs.comb.sites)==max(lengths(yrs.comb.sites))])
for (i in seq_along(best)) {
  attr(best[[i]], 'yrs') <- as.numeric(names(yr.sites))[attr(best[[i]], 'yrs')]
}
best
# [[1]]
# [1] 26 53 84
# attr(,"yrs")
# [1] 1989 1991 1998 2001 2011
# 
# [[2]]
# [1] 31 59 67
# attr(,"yrs")
# [1] 1989 1992 1999 2002 2005

Original answer

This is a very slow brute force approach and both @Mark and @jblood94 provided a much more effective solution. On my machine (not a powerful one, to be honest) this takes about 80 seconds to compute.

yr.site <- tapply(DF$Sites, DF$Years, identity, simplify=F)
year.comb.sites <- combn(yr.site, 5, Reduce, f=intersect, simplify=F)
max.groups <- year.comb.sites[lengths(year.comb.sites) == max(lengths(year.comb.sites))]
max.groups
# [[1]]
# [1] 26 53 84
# 
# [[2]]
# [1] 31 59 67

The code is short but there is a lot of computation behind it. yr.sites is a list which stores sites sampled in each year. The combn line generates all combinations of years of length five (there are 658008 such combinations) and for each combination finds the sites which were sampled in all of these years. In the end, there are 2 groups of a maximum of three sites which satisfy the initial condition.

And we can confirm this with

sort(table(subset(DF, Sites %in% max.groups[[1]])$Years))
# 1984 1988 1990 1993 1994 1996 1999 2000 2003 2005 2010 2017 2018 2021 1989 1991 1998 2001 2011 
#    1    1    1    1    1    1    1    1    1    1    1    1    1    1    3    3    3    3    3

which shows that indeed all three sites in, e.g., the first group were sampled in 1989, 1991, 1998, 2001 and 2011.

As I already said, this solution is not effective and wouldn't scale well if you wanted to ask for more than 5 shared years. E.g., in the case of 6 years, the number of combinations (choose(40, 6)) would increase to 3838380 and a higher number of years would not be good news either.

score 5 · Answer 5 · answered Jul 25 '23 at 09:38

Follow-up to the `igraph` solution: Break Down Codes

I don't want my igraph answer too long to read, so I created a new answer to elaborate how the that solution works.

Dummy Example

For simplicity, we can start with a smaller dummy example, and we assume that our objective is to find the maximal number of Sites that share at least 3 years.

set.seed(0)
DF <- data.frame(
    Sites = rep(1:10, each = 10),
    Years = sample(2000:2020, 100, replace = TRUE)
) %>%
    unique()

and the visualization of DF in a graph looks like can be done by

DF %>%
    graph_from_data_frame() %>%
    set_vertex_attr(name = "color", value = names(V(.)) %in% DF$Sites) %>%
    plot()

Steps

1). As we can see, the graph can be treated as a bipartite graph (Sites and Years are two different types of vertices) and the shared Years actually can be projected to edges. Since we need to keep track of the number of shared Years, we can use an edge attribute "weight" to identify counts of shared Years. In that case, an adjacency matrix of Sites is required before the projection and it is obtained by table + tcrossprod, e.g.,

adjmat <- DF %>%
    table() %>%
    tcrossprod()

which gives

> adjmat
     Sites
Sites 1 2 3 4 5 6 7 8 9 10
   1  8 2 3 0 4 4 4 2 5  3
   2  2 7 5 2 3 1 2 3 2  4
   3  3 5 9 4 4 3 2 3 2  4
   4  0 2 4 6 3 2 1 2 1  3
   5  4 3 4 3 9 5 4 4 4  4
   6  4 1 3 2 5 7 2 3 2  3
   7  4 2 2 1 4 2 9 4 3  4
   8  2 3 3 2 4 3 4 9 2  5
   9  5 2 2 1 4 2 3 2 7  4
   10 3 4 4 3 4 3 4 5 4  9

2). As stated in the objective, we want to find out the groups that have >=3 shared years, that means, the "weight" of edges should be at least 3. On top of the obtained adjacency matrix adjmat in step 1), we can further apply a filter (>=3) to simplify the matrix, which is equivalent to pruning the network, i.e.,

g <- adjmat %>%
    `>=`(3) %>%
    graph_from_adjacency_matrix(mode = "undirected", diag = FALSE)

and plot(g) shows the project like below

3). We have known the fact that, the Sites that share the same Years should produce a complete subgraph, namely, clique. Thus, we hereby can enumerate all cliques in graph g, which is done by the function cliques(), i.e.,

clq <- g %>%
    cliques(min = 2)

where min = 2 specifies the minimal size of each cliques. If you have no prior information about the size, it is also fine to use cliques() without any additional arguments. Now, clq is a list which looks like

> clq 
[[1]]
+ 2/10 vertices, named, from 3b71441:
[1] 5  10

[[2]]
+ 2/10 vertices, named, from 3b71441:
[1] 3 5

[[3]]
+ 3/10 vertices, named, from 3b71441:
[1] 3  5  10

[[4]]
+ 2/10 vertices, named, from 3b71441:
[1] 3  10

[[5]]
+ 2/10 vertices, named, from 3b71441:
[1] 5 7

[[6]]
+ 3/10 vertices, named, from 3b71441:
[1] 5  7  10

[[7]]
+ 2/10 vertices, named, from 3b71441:
[1] 7  10

[[8]]
+ 2/10 vertices, named, from 3b71441:
[1] 7 8

[[9]]
+ 3/10 vertices, named, from 3b71441:
[1] 5 7 8

[[10]]
+ 4/10 vertices, named, from 3b71441:
[1] 5  7  8  10

[[11]]
+ 3/10 vertices, named, from 3b71441:
[1] 7  8  10

[[12]]
+ 2/10 vertices, named, from 3b71441:
[1] 3 8

[[13]]
+ 3/10 vertices, named, from 3b71441:
[1] 3 5 8

[[14]]
+ 4/10 vertices, named, from 3b71441:
[1] 3  5  8  10

[[15]]
+ 3/10 vertices, named, from 3b71441:
[1] 3  8  10

[[16]]
+ 2/10 vertices, named, from 3b71441:
[1] 5 8

[[17]]
+ 3/10 vertices, named, from 3b71441:
[1] 5  8  10

[[18]]
+ 2/10 vertices, named, from 3b71441:
[1] 8  10

[[19]]
+ 2/10 vertices, named, from 3b71441:
[1] 1 7

[[20]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 5 7

[[21]]
+ 4/10 vertices, named, from 3b71441:
[1] 1  5  7  10

[[22]]
+ 3/10 vertices, named, from 3b71441:
[1] 1  7  10

[[23]]
+ 2/10 vertices, named, from 3b71441:
[1] 1 3

[[24]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 3 5

[[25]]
+ 4/10 vertices, named, from 3b71441:
[1] 1  3  5  10

[[26]]
+ 3/10 vertices, named, from 3b71441:
[1] 1  3  10

[[27]]
+ 2/10 vertices, named, from 3b71441:
[1] 1 5

[[28]]
+ 3/10 vertices, named, from 3b71441:
[1] 1  5  10

[[29]]
+ 2/10 vertices, named, from 3b71441:
[1] 1  10

[[30]]
+ 2/10 vertices, named, from 3b71441:
[1] 3 4

[[31]]
+ 3/10 vertices, named, from 3b71441:
[1] 3 4 5

[[32]]
+ 4/10 vertices, named, from 3b71441:
[1] 3  4  5  10

[[33]]
+ 3/10 vertices, named, from 3b71441:
[1] 3  4  10

[[34]]
+ 2/10 vertices, named, from 3b71441:
[1] 4 5

[[35]]
+ 3/10 vertices, named, from 3b71441:
[1] 4  5  10

[[36]]
+ 2/10 vertices, named, from 3b71441:
[1] 4  10

[[37]]
+ 2/10 vertices, named, from 3b71441:
[1] 1 9

[[38]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 7 9

[[39]]
+ 4/10 vertices, named, from 3b71441:
[1] 1 5 7 9

[[40]]
+ 5/10 vertices, named, from 3b71441:
[1] 1  5  7  9  10

[[41]]
+ 4/10 vertices, named, from 3b71441:
[1] 1  7  9  10

[[42]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 5 9

[[43]]
+ 4/10 vertices, named, from 3b71441:
[1] 1  5  9  10

[[44]]
+ 3/10 vertices, named, from 3b71441:
[1] 1  9  10

[[45]]
+ 2/10 vertices, named, from 3b71441:
[1] 7 9

[[46]]
+ 3/10 vertices, named, from 3b71441:
[1] 5 7 9

[[47]]
+ 4/10 vertices, named, from 3b71441:
[1] 5  7  9  10

[[48]]
+ 3/10 vertices, named, from 3b71441:
[1] 7  9  10

[[49]]
+ 2/10 vertices, named, from 3b71441:
[1] 5 9

[[50]]
+ 3/10 vertices, named, from 3b71441:
[1] 5  9  10

[[51]]
+ 2/10 vertices, named, from 3b71441:
[1] 9  10

[[52]]
+ 2/10 vertices, named, from 3b71441:
[1] 1 6

[[53]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 3 6

[[54]]
+ 4/10 vertices, named, from 3b71441:
[1] 1 3 5 6

[[55]]
+ 5/10 vertices, named, from 3b71441:
[1] 1  3  5  6  10

[[56]]
+ 4/10 vertices, named, from 3b71441:
[1] 1  3  6  10

[[57]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 5 6

[[58]]
+ 4/10 vertices, named, from 3b71441:
[1] 1  5  6  10

[[59]]
+ 3/10 vertices, named, from 3b71441:
[1] 1  6  10

[[60]]
+ 2/10 vertices, named, from 3b71441:
[1] 6 8

[[61]]
+ 3/10 vertices, named, from 3b71441:
[1] 3 6 8

[[62]]
+ 4/10 vertices, named, from 3b71441:
[1] 3 5 6 8

[[63]]
+ 5/10 vertices, named, from 3b71441:
[1] 3  5  6  8  10

[[64]]
+ 4/10 vertices, named, from 3b71441:
[1] 3  6  8  10

[[65]]
+ 3/10 vertices, named, from 3b71441:
[1] 5 6 8

[[66]]
+ 4/10 vertices, named, from 3b71441:
[1] 5  6  8  10

[[67]]
+ 3/10 vertices, named, from 3b71441:
[1] 6  8  10

[[68]]
+ 2/10 vertices, named, from 3b71441:
[1] 3 6

[[69]]
+ 3/10 vertices, named, from 3b71441:
[1] 3 5 6

[[70]]
+ 4/10 vertices, named, from 3b71441:
[1] 3  5  6  10

[[71]]
+ 3/10 vertices, named, from 3b71441:
[1] 3  6  10

[[72]]
+ 2/10 vertices, named, from 3b71441:
[1] 5 6

[[73]]
+ 3/10 vertices, named, from 3b71441:
[1] 5  6  10

[[74]]
+ 2/10 vertices, named, from 3b71441:
[1] 6  10

[[75]]
+ 2/10 vertices, named, from 3b71441:
[1] 2 8

[[76]]
+ 3/10 vertices, named, from 3b71441:
[1] 2 3 8

[[77]]
+ 4/10 vertices, named, from 3b71441:
[1] 2 3 5 8

[[78]]
+ 5/10 vertices, named, from 3b71441:
[1] 2  3  5  8  10

[[79]]
+ 4/10 vertices, named, from 3b71441:
[1] 2  3  8  10

[[80]]
+ 3/10 vertices, named, from 3b71441:
[1] 2 5 8

[[81]]
+ 4/10 vertices, named, from 3b71441:
[1] 2  5  8  10

[[82]]
+ 3/10 vertices, named, from 3b71441:
[1] 2  8  10

[[83]]
+ 2/10 vertices, named, from 3b71441:
[1] 2 3

[[84]]
+ 3/10 vertices, named, from 3b71441:
[1] 2 3 5

[[85]]
+ 4/10 vertices, named, from 3b71441:
[1] 2  3  5  10

[[86]]
+ 3/10 vertices, named, from 3b71441:
[1] 2  3  10

[[87]]
+ 2/10 vertices, named, from 3b71441:
[1] 2 5

[[88]]
+ 3/10 vertices, named, from 3b71441:
[1] 2  5  10

[[89]]
+ 2/10 vertices, named, from 3b71441:
[1] 2  10

4). It should be noted that, not all of the cliques meet >=3 requirement in terms of shared Years since the "weight" indicates the count of total shared Years, instead of the count of distinguished shared Years. In other words, 2000, 2001, 2002 is valid but 2000, 2000, 2002 is not, although the latter has a count of 3. We therefore need to check the distribution of shared Years in each clique.

For example, if we look into the third clique, i.e., clq[[3]], and check the distribution of shared Years

> q <- clq[[3]]

> table(subset(DF, Sites %in% names(q)))
     Years
Sites 2000 2001 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
   3     1    1    1    0    0    0    1    1    0    1    0    1    1    0
   5     0    1    0    1    1    1    0    1    0    1    1    1    0    0
   10    1    0    0    0    1    1    0    1    1    1    0    0    1    1
     Years
Sites 2018 2019 2020
   3     0    1    0
   5     1    0    0
   10    0    0    1

we see that columns with all 1s are 2009 and 2011, which means they have only 2 shared Years and thus invalid. To pick the valid cliques, we can use Filter with a filtering criteria, e.g.,

validclq <- clq %>%
    Filter(
        \(q) {
            sum(table(with(DF, Years[Sites %in% names(q)])) == length(q)) >= 3
        }, .
    )

and we will see

> validclq
[[1]]
+ 2/10 vertices, named, from 9bd9430:
[1] 5  10

[[2]]
+ 2/10 vertices, named, from 9bd9430:
[1] 3 5

[[3]]
+ 2/10 vertices, named, from 9bd9430:
[1] 3  10

[[4]]
+ 2/10 vertices, named, from 9bd9430:
[1] 5 7

[[5]]
+ 2/10 vertices, named, from 9bd9430:
[1] 7  10

[[6]]
+ 2/10 vertices, named, from 9bd9430:
[1] 7 8

[[7]]
+ 2/10 vertices, named, from 9bd9430:
[1] 3 8

[[8]]
+ 2/10 vertices, named, from 9bd9430:
[1] 5 8

[[9]]
+ 2/10 vertices, named, from 9bd9430:
[1] 8  10

[[10]]
+ 2/10 vertices, named, from 9bd9430:
[1] 1 7

[[11]]
+ 2/10 vertices, named, from 9bd9430:
[1] 1 3

[[12]]
+ 2/10 vertices, named, from 9bd9430:
[1] 1 5

[[13]]
+ 2/10 vertices, named, from 9bd9430:
[1] 1  10

[[14]]
+ 2/10 vertices, named, from 9bd9430:
[1] 3 4

[[15]]
+ 3/10 vertices, named, from 9bd9430:
[1] 3  4  10

[[16]]
+ 2/10 vertices, named, from 9bd9430:
[1] 4 5

[[17]]
+ 2/10 vertices, named, from 9bd9430:
[1] 4  10

[[18]]
+ 2/10 vertices, named, from 9bd9430:
[1] 1 9

[[19]]
+ 3/10 vertices, named, from 9bd9430:
[1] 1 5 9

[[20]]
+ 2/10 vertices, named, from 9bd9430:
[1] 7 9

[[21]]
+ 3/10 vertices, named, from 9bd9430:
[1] 7  9  10

[[22]]
+ 2/10 vertices, named, from 9bd9430:
[1] 5 9

[[23]]
+ 2/10 vertices, named, from 9bd9430:
[1] 9  10

[[24]]
+ 2/10 vertices, named, from 9bd9430:
[1] 1 6

[[25]]
+ 2/10 vertices, named, from 9bd9430:
[1] 6 8

[[26]]
+ 2/10 vertices, named, from 9bd9430:
[1] 3 6

[[27]]
+ 2/10 vertices, named, from 9bd9430:
[1] 5 6

[[28]]
+ 2/10 vertices, named, from 9bd9430:
[1] 6  10

[[29]]
+ 2/10 vertices, named, from 9bd9430:
[1] 2 8

[[30]]
+ 2/10 vertices, named, from 9bd9430:
[1] 2 3

[[31]]
+ 2/10 vertices, named, from 9bd9430:
[1] 2 5

[[32]]
+ 2/10 vertices, named, from 9bd9430:
[1] 2  10

where 32 out of 89 cliques meets the requirement.

5). As the last step, we select the cliques that have the largest size (since we are looking for the maximal number of Sites) from the valid cliques validclq. Using lenghts to obtain the size of each clique, we can filter the ones with the largest size, e.g.,

res <- validclq %>%
    `[`(lengths(.) == max(lengths(.)))

and we finally obtained

[[1]]
+ 3/10 vertices, named, from 75803c6:
[1] 3  4  10

[[2]]
+ 3/10 vertices, named, from 75803c6:
[1] 1 5 9

[[3]]
+ 3/10 vertices, named, from 75803c6:
[1] 7  9  10

It is a very elegant approach, but in the end not very effective to solve my problem. But I can see other uses for it. Thank you very much for sharing. Cheers — mferreira, Jul 26 '23 at 10:57
@mferreira I added a new solution which are implemented by base R with loops. A benchmarking is added as well for reference. https://stackoverflow.com/a/76775485/12158757 — ThomasIsCoding, Jul 26 '23 at 22:43

ThomasIsCoding · Answer 6 · 2023-08-03T10:05:18.790

Here are two base R options using the Dynamic Programming technique, which search the desired output through a "tree", and the "tree" can be expanded when a new feasible Sites is added.

Note: If you would like to significantly boost the performance of this approach, you probably can use Rcpp to have a translation from base R to Cpp. Unfortunately I am not familiar with Rcpp but hope someone can contribute a Cpp version someday in the future.

Base R Implementations

Option 1 ftic1, which checker retrieves the rows from the matrix and checks if the number of shared years meet the requirement

ftic1 <- function(DF, kmin) {
    # create a `Sites`-to-`Years` table
    M <- table(DF)
    nr <- nrow(M)

    # helper function that checks the validity
    checker <- function(idx, kmin) {
        sum(colSums(M[idx, , drop = FALSE]) == length(idx)) >= kmin
    }

    # start from each site
    res <- as.list(seq(nr))
    repeat {
        lst <- list()
        for (l in res) {
            u <- unlist(l)
            uend <- u[length(u)]
            if (uend < nr) {
                for (p in (uend + 1):nr) {
                    ks <- c(u, p)
                    if (checker(ks, kmin)) {
                        lst[length(lst) + 1] <- list(ks)
                    }
                }
            }
        }
        if (length(lst) == 0) {
            break
        } else {
            res <- lst
        }
    }
    res
}

Option 2 ftic2, where we applied a mask msk to boost the checker since it only needs current mask and a new row for the checking (since colSums used redundant info over and over again, which results in poor efficiency). Thus, ftic2 should be more performant than ftic1, although it looks less neat!

ftic2 <- function(DF, kmin) {
    # create a `Sites`-to-`Years` table
    M <- table(DF)
    nr <- nrow(M)

    # helper function that checks the validity
    checker <- function(msk, v, kmin) {
        sum(msk & v) >= kmin
    }

    # start from each site
    res <- lapply(1:nr, \(x) list(idx = x, msk = M[x, ]))
    repeat {
        lst <- list()
        for (l in res) {
            u <- l$idx
            msk <- l$msk
            uend <- u[length(u)]
            if (uend < nr) {
                for (p in (uend + 1):nr) {
                    v <- M[p, ]
                    if (checker(msk, v, kmin)) {
                        lst[[length(lst) + 1]] <- list(idx = c(u, p), msk = msk & v)
                    }
                }
            }
        }
        if (length(lst) == 0) {
            break
        } else {
            res <- lst
        }
    }
    res
}

Example Output

> ftic1(DF, 5)
[[1]]
[1] 26 53 84

[[2]]
[1] 31 59 67


> ftic2(DF, 5)
[[1]]
[[1]]$idx
[1] 26 53 84

[[1]]$msk
 1982  1983  1984  1985  1986  1987  1988  1989  1990  1991  1992  1993  1994
FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE
 1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007
FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
 2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020
FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 2021
FALSE


[[2]]
[[2]]$idx
[1] 31 59 67

[[2]]$msk
 1982  1983  1984  1985  1986  1987  1988  1989  1990  1991  1992  1993  1994
FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
 1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007
FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
 2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 2021
FALSE

As we can see, in the result by ftic2, the msk field indicates which Years are shared by the Sites indices provided by the idx field. In this sense, ftic2 is more handy than ftic1 when you intend to fetch the information of shared Years.

Benchmarking

Here we take @jblood94's solution as the baseline, since that solution is the most efficient so far.

>=3

microbenchmark(
    ftic1(DF, 3),
    ftic2(DF, 3),
    f(DF, 3)[],
    times = 20L,
    unit = "relative"
)
Unit: relative
         expr      min       lq      mean   median        uq       max neval
 ftic1(DF, 3) 22.44911 21.49605 19.464670 21.41181 19.943792 10.261857    20
 ftic2(DF, 3) 10.68673 10.27488  9.004385 10.18160  9.215856  3.757149    20
   f(DF, 3)[]  1.00000  1.00000  1.000000  1.00000  1.000000  1.000000    20

>=4

microbenchmark(
    ftic1(DF, 4),
    ftic2(DF, 4),
    f(DF, 4)[],
    times = 20L,
    unit = "relative"
)

Unit: relative
         expr       min       lq      mean    median        uq      max neval
 ftic1(DF, 4) 20.135953 19.60166 15.052466 19.268706 17.753250 4.583222    20
 ftic2(DF, 4)  9.843199  9.47094  7.098557  9.405802  8.107661 1.985759    20
   f(DF, 4)[]  1.000000  1.00000  1.000000  1.000000  1.000000 1.000000    20

>=5

microbenchmark(
    ftic1(DF, 5),
    ftic2(DF, 5),
    f(DF, 5)[],
    times = 20L,
    unit = "relative"
)

Unit: relative
         expr       min        lq     mean    median        uq       max neval
 ftic1(DF, 5) 16.953898 16.060962 9.512281 15.488721 13.738142 1.5683931    20
 ftic2(DF, 5)  7.793642  7.567225 4.546013  7.409521  6.658683 0.7780443    20
   f(DF, 5)[]  1.000000  1.000000 1.000000  1.000000  1.000000 1.0000000    20

>=6

microbenchmark(
    ftic1(DF, 6),
    ftic2(DF, 6),
    f(DF, 6)[],
    times = 20L,
    unit = "relative"
)

Unit: relative
         expr      min       lq     mean    median       uq       max neval
 ftic1(DF, 6) 9.031968 9.819239 5.744555 10.091242 9.760496 1.0847252    20
 ftic2(DF, 6) 4.417896 4.552029 2.721448  4.861791 4.782947 0.4812401    20
   f(DF, 6)[] 1.000000 1.000000 1.000000  1.000000 1.000000 1.0000000    20

Selecting the maximum number of values from a column that share values in another column in R

6 Answers6

Disclaimer

Brief Idea (Graph Theory Perspective)

One igraph Approach

Discussion

Benchmarking

Data

Original answer

Follow-up to the igraph solution: Break Down Codes

Dummy Example

Steps

Base R Implementations

Example Output

Benchmarking

One `igraph` Approach

Follow-up to the `igraph` solution: Break Down Codes