Follow-up to the igraph
solution: Break Down Codes
I don't want my igraph
answer too long to read, so I created a new answer to elaborate how the that solution works.
Dummy Example
For simplicity, we can start with a smaller dummy example, and we assume that our objective is to find the maximal number of Sites
that share at least 3
years.
set.seed(0)
DF <- data.frame(
Sites = rep(1:10, each = 10),
Years = sample(2000:2020, 100, replace = TRUE)
) %>%
unique()
and the visualization of DF
in a graph looks like can be done by
DF %>%
graph_from_data_frame() %>%
set_vertex_attr(name = "color", value = names(V(.)) %in% DF$Sites) %>%
plot()

Steps
1). As we can see, the graph can be treated as a bipartite graph (Sites
and Years
are two different types of vertices) and the shared Years
actually can be projected to edges. Since we need to keep track of the number of shared Years
, we can use an edge attribute "weight"
to identify counts of shared Years
. In that case, an adjacency matrix of Sites
is required before the projection and it is obtained by table
+ tcrossprod
, e.g.,
adjmat <- DF %>%
table() %>%
tcrossprod()
which gives
> adjmat
Sites
Sites 1 2 3 4 5 6 7 8 9 10
1 8 2 3 0 4 4 4 2 5 3
2 2 7 5 2 3 1 2 3 2 4
3 3 5 9 4 4 3 2 3 2 4
4 0 2 4 6 3 2 1 2 1 3
5 4 3 4 3 9 5 4 4 4 4
6 4 1 3 2 5 7 2 3 2 3
7 4 2 2 1 4 2 9 4 3 4
8 2 3 3 2 4 3 4 9 2 5
9 5 2 2 1 4 2 3 2 7 4
10 3 4 4 3 4 3 4 5 4 9
2). As stated in the objective, we want to find out the groups that have >=3
shared years, that means, the "weight"
of edges should be at least 3
. On top of the obtained adjacency matrix adjmat
in step 1), we can further apply a filter (>=3
) to simplify the matrix, which is equivalent to pruning the network, i.e.,
g <- adjmat %>%
`>=`(3) %>%
graph_from_adjacency_matrix(mode = "undirected", diag = FALSE)
and plot(g)
shows the project like below

3). We have known the fact that, the Sites
that share the same Years
should produce a complete subgraph, namely, clique. Thus, we hereby can enumerate all cliques in graph g
, which is done by the function cliques()
, i.e.,
clq <- g %>%
cliques(min = 2)
where min = 2
specifies the minimal size of each cliques. If you have no prior information about the size, it is also fine to use cliques()
without any additional arguments. Now, clq
is a list which looks like
> clq
[[1]]
+ 2/10 vertices, named, from 3b71441:
[1] 5 10
[[2]]
+ 2/10 vertices, named, from 3b71441:
[1] 3 5
[[3]]
+ 3/10 vertices, named, from 3b71441:
[1] 3 5 10
[[4]]
+ 2/10 vertices, named, from 3b71441:
[1] 3 10
[[5]]
+ 2/10 vertices, named, from 3b71441:
[1] 5 7
[[6]]
+ 3/10 vertices, named, from 3b71441:
[1] 5 7 10
[[7]]
+ 2/10 vertices, named, from 3b71441:
[1] 7 10
[[8]]
+ 2/10 vertices, named, from 3b71441:
[1] 7 8
[[9]]
+ 3/10 vertices, named, from 3b71441:
[1] 5 7 8
[[10]]
+ 4/10 vertices, named, from 3b71441:
[1] 5 7 8 10
[[11]]
+ 3/10 vertices, named, from 3b71441:
[1] 7 8 10
[[12]]
+ 2/10 vertices, named, from 3b71441:
[1] 3 8
[[13]]
+ 3/10 vertices, named, from 3b71441:
[1] 3 5 8
[[14]]
+ 4/10 vertices, named, from 3b71441:
[1] 3 5 8 10
[[15]]
+ 3/10 vertices, named, from 3b71441:
[1] 3 8 10
[[16]]
+ 2/10 vertices, named, from 3b71441:
[1] 5 8
[[17]]
+ 3/10 vertices, named, from 3b71441:
[1] 5 8 10
[[18]]
+ 2/10 vertices, named, from 3b71441:
[1] 8 10
[[19]]
+ 2/10 vertices, named, from 3b71441:
[1] 1 7
[[20]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 5 7
[[21]]
+ 4/10 vertices, named, from 3b71441:
[1] 1 5 7 10
[[22]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 7 10
[[23]]
+ 2/10 vertices, named, from 3b71441:
[1] 1 3
[[24]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 3 5
[[25]]
+ 4/10 vertices, named, from 3b71441:
[1] 1 3 5 10
[[26]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 3 10
[[27]]
+ 2/10 vertices, named, from 3b71441:
[1] 1 5
[[28]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 5 10
[[29]]
+ 2/10 vertices, named, from 3b71441:
[1] 1 10
[[30]]
+ 2/10 vertices, named, from 3b71441:
[1] 3 4
[[31]]
+ 3/10 vertices, named, from 3b71441:
[1] 3 4 5
[[32]]
+ 4/10 vertices, named, from 3b71441:
[1] 3 4 5 10
[[33]]
+ 3/10 vertices, named, from 3b71441:
[1] 3 4 10
[[34]]
+ 2/10 vertices, named, from 3b71441:
[1] 4 5
[[35]]
+ 3/10 vertices, named, from 3b71441:
[1] 4 5 10
[[36]]
+ 2/10 vertices, named, from 3b71441:
[1] 4 10
[[37]]
+ 2/10 vertices, named, from 3b71441:
[1] 1 9
[[38]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 7 9
[[39]]
+ 4/10 vertices, named, from 3b71441:
[1] 1 5 7 9
[[40]]
+ 5/10 vertices, named, from 3b71441:
[1] 1 5 7 9 10
[[41]]
+ 4/10 vertices, named, from 3b71441:
[1] 1 7 9 10
[[42]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 5 9
[[43]]
+ 4/10 vertices, named, from 3b71441:
[1] 1 5 9 10
[[44]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 9 10
[[45]]
+ 2/10 vertices, named, from 3b71441:
[1] 7 9
[[46]]
+ 3/10 vertices, named, from 3b71441:
[1] 5 7 9
[[47]]
+ 4/10 vertices, named, from 3b71441:
[1] 5 7 9 10
[[48]]
+ 3/10 vertices, named, from 3b71441:
[1] 7 9 10
[[49]]
+ 2/10 vertices, named, from 3b71441:
[1] 5 9
[[50]]
+ 3/10 vertices, named, from 3b71441:
[1] 5 9 10
[[51]]
+ 2/10 vertices, named, from 3b71441:
[1] 9 10
[[52]]
+ 2/10 vertices, named, from 3b71441:
[1] 1 6
[[53]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 3 6
[[54]]
+ 4/10 vertices, named, from 3b71441:
[1] 1 3 5 6
[[55]]
+ 5/10 vertices, named, from 3b71441:
[1] 1 3 5 6 10
[[56]]
+ 4/10 vertices, named, from 3b71441:
[1] 1 3 6 10
[[57]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 5 6
[[58]]
+ 4/10 vertices, named, from 3b71441:
[1] 1 5 6 10
[[59]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 6 10
[[60]]
+ 2/10 vertices, named, from 3b71441:
[1] 6 8
[[61]]
+ 3/10 vertices, named, from 3b71441:
[1] 3 6 8
[[62]]
+ 4/10 vertices, named, from 3b71441:
[1] 3 5 6 8
[[63]]
+ 5/10 vertices, named, from 3b71441:
[1] 3 5 6 8 10
[[64]]
+ 4/10 vertices, named, from 3b71441:
[1] 3 6 8 10
[[65]]
+ 3/10 vertices, named, from 3b71441:
[1] 5 6 8
[[66]]
+ 4/10 vertices, named, from 3b71441:
[1] 5 6 8 10
[[67]]
+ 3/10 vertices, named, from 3b71441:
[1] 6 8 10
[[68]]
+ 2/10 vertices, named, from 3b71441:
[1] 3 6
[[69]]
+ 3/10 vertices, named, from 3b71441:
[1] 3 5 6
[[70]]
+ 4/10 vertices, named, from 3b71441:
[1] 3 5 6 10
[[71]]
+ 3/10 vertices, named, from 3b71441:
[1] 3 6 10
[[72]]
+ 2/10 vertices, named, from 3b71441:
[1] 5 6
[[73]]
+ 3/10 vertices, named, from 3b71441:
[1] 5 6 10
[[74]]
+ 2/10 vertices, named, from 3b71441:
[1] 6 10
[[75]]
+ 2/10 vertices, named, from 3b71441:
[1] 2 8
[[76]]
+ 3/10 vertices, named, from 3b71441:
[1] 2 3 8
[[77]]
+ 4/10 vertices, named, from 3b71441:
[1] 2 3 5 8
[[78]]
+ 5/10 vertices, named, from 3b71441:
[1] 2 3 5 8 10
[[79]]
+ 4/10 vertices, named, from 3b71441:
[1] 2 3 8 10
[[80]]
+ 3/10 vertices, named, from 3b71441:
[1] 2 5 8
[[81]]
+ 4/10 vertices, named, from 3b71441:
[1] 2 5 8 10
[[82]]
+ 3/10 vertices, named, from 3b71441:
[1] 2 8 10
[[83]]
+ 2/10 vertices, named, from 3b71441:
[1] 2 3
[[84]]
+ 3/10 vertices, named, from 3b71441:
[1] 2 3 5
[[85]]
+ 4/10 vertices, named, from 3b71441:
[1] 2 3 5 10
[[86]]
+ 3/10 vertices, named, from 3b71441:
[1] 2 3 10
[[87]]
+ 2/10 vertices, named, from 3b71441:
[1] 2 5
[[88]]
+ 3/10 vertices, named, from 3b71441:
[1] 2 5 10
[[89]]
+ 2/10 vertices, named, from 3b71441:
[1] 2 10
4). It should be noted that, not all of the cliques meet >=3
requirement in terms of shared Years
since the "weight" indicates the count of total shared Years
, instead of the count of distinguished shared Years
. In other words, 2000, 2001, 2002
is valid but 2000, 2000, 2002
is not, although the latter has a count of 3
. We therefore need to check the distribution of shared Years
in each clique.
For example, if we look into the third clique, i.e., clq[[3]]
, and check the distribution of shared Years
> q <- clq[[3]]
> table(subset(DF, Sites %in% names(q)))
Years
Sites 2000 2001 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
3 1 1 1 0 0 0 1 1 0 1 0 1 1 0
5 0 1 0 1 1 1 0 1 0 1 1 1 0 0
10 1 0 0 0 1 1 0 1 1 1 0 0 1 1
Years
Sites 2018 2019 2020
3 0 1 0
5 1 0 0
10 0 0 1
we see that columns with all 1
s are 2009
and 2011
, which means they have only 2 shared Years
and thus invalid. To pick the valid cliques, we can use Filter
with a filtering criteria, e.g.,
validclq <- clq %>%
Filter(
\(q) {
sum(table(with(DF, Years[Sites %in% names(q)])) == length(q)) >= 3
}, .
)
and we will see
> validclq
[[1]]
+ 2/10 vertices, named, from 9bd9430:
[1] 5 10
[[2]]
+ 2/10 vertices, named, from 9bd9430:
[1] 3 5
[[3]]
+ 2/10 vertices, named, from 9bd9430:
[1] 3 10
[[4]]
+ 2/10 vertices, named, from 9bd9430:
[1] 5 7
[[5]]
+ 2/10 vertices, named, from 9bd9430:
[1] 7 10
[[6]]
+ 2/10 vertices, named, from 9bd9430:
[1] 7 8
[[7]]
+ 2/10 vertices, named, from 9bd9430:
[1] 3 8
[[8]]
+ 2/10 vertices, named, from 9bd9430:
[1] 5 8
[[9]]
+ 2/10 vertices, named, from 9bd9430:
[1] 8 10
[[10]]
+ 2/10 vertices, named, from 9bd9430:
[1] 1 7
[[11]]
+ 2/10 vertices, named, from 9bd9430:
[1] 1 3
[[12]]
+ 2/10 vertices, named, from 9bd9430:
[1] 1 5
[[13]]
+ 2/10 vertices, named, from 9bd9430:
[1] 1 10
[[14]]
+ 2/10 vertices, named, from 9bd9430:
[1] 3 4
[[15]]
+ 3/10 vertices, named, from 9bd9430:
[1] 3 4 10
[[16]]
+ 2/10 vertices, named, from 9bd9430:
[1] 4 5
[[17]]
+ 2/10 vertices, named, from 9bd9430:
[1] 4 10
[[18]]
+ 2/10 vertices, named, from 9bd9430:
[1] 1 9
[[19]]
+ 3/10 vertices, named, from 9bd9430:
[1] 1 5 9
[[20]]
+ 2/10 vertices, named, from 9bd9430:
[1] 7 9
[[21]]
+ 3/10 vertices, named, from 9bd9430:
[1] 7 9 10
[[22]]
+ 2/10 vertices, named, from 9bd9430:
[1] 5 9
[[23]]
+ 2/10 vertices, named, from 9bd9430:
[1] 9 10
[[24]]
+ 2/10 vertices, named, from 9bd9430:
[1] 1 6
[[25]]
+ 2/10 vertices, named, from 9bd9430:
[1] 6 8
[[26]]
+ 2/10 vertices, named, from 9bd9430:
[1] 3 6
[[27]]
+ 2/10 vertices, named, from 9bd9430:
[1] 5 6
[[28]]
+ 2/10 vertices, named, from 9bd9430:
[1] 6 10
[[29]]
+ 2/10 vertices, named, from 9bd9430:
[1] 2 8
[[30]]
+ 2/10 vertices, named, from 9bd9430:
[1] 2 3
[[31]]
+ 2/10 vertices, named, from 9bd9430:
[1] 2 5
[[32]]
+ 2/10 vertices, named, from 9bd9430:
[1] 2 10
where 32 out of 89 cliques meets the requirement.
5). As the last step, we select the cliques that have the largest size (since we are looking for the maximal number of Sites
) from the valid cliques validclq
. Using lenghts
to obtain the size of each clique, we can filter the ones with the largest size, e.g.,
res <- validclq %>%
`[`(lengths(.) == max(lengths(.)))
and we finally obtained
[[1]]
+ 3/10 vertices, named, from 75803c6:
[1] 3 4 10
[[2]]
+ 3/10 vertices, named, from 75803c6:
[1] 1 5 9
[[3]]
+ 3/10 vertices, named, from 75803c6:
[1] 7 9 10