0

Let's say I have run different tests to see if some objects are identical. The testing was done pairwise, and I have a dataframe containing the pairs of objects that are the same:

same.pairs <- data.frame(Test=c(rep(1, 4), rep(2, 6)),
                         First=c("A", "A", "B", "D", "A", "A", "B", "C", "C", "D"), 
                         Second=c("B", "C", "C", "E", "B", "E", "E", "D", "G", "G"))

##

Test First Second
   1     A      B
   1     A      C
   1     B      C
   1     D      E
   2     A      B
   2     A      E
   2     B      E
   2     C      D
   2     C      G
   2     D      G

From this I can see that in Test 1, because A = B and A = C and B = C, then A = B = C and these 3 objects belong in one set of size 3.

I want to know the full size of the sets for each test. For this example, I want to know that for Test 1, one set is 3 identical objects (A, B, C) and one set is 2 (D, E), and for Test 2, two sets are size 3 ((A, B, E) and (C, D, G)). I don't need to know which objects are in each set, just the size of the sets and the counts of how many sets are that size:

Test ReplicateSize Count
   1             3     1
   1             2     1
   2             3     2

Is there an elegant way to do this? I thought I had it with this:

sets <-  same.pairs %>%
  group_by(Test, First) %>%
  summarize(ReplicateSize=n()) %>%
  # add 1 to size because above only counting second genotype, need to include first
  mutate(ReplicateSize=ReplicateSize+1) %>%
  select(-First) %>%
  ungroup() %>%
  group_by(Test, ReplicateSize) %>%
  summarize(Count=n()) %>%
  arrange(Test, ReplicateSize)

##

Test ReplicateSize Count
   1             2     2
   1             3     1
   2             2     2
   2             3     2

but this is double counting some of the sets as, for example in Test 1, B&C are counted as a set of size 2 instead of ignored as they are already part of a set with A. I'm not sure how to skip rows where the First object has already been observed as the Second object without making a complicated for loop.

Any guidance appreciated.

hmg
  • 412
  • 1
  • 4
  • 18
  • https://stackoverflow.com/questions/30407769/get-connected-components-using-igraph-in-r – rawr Mar 04 '21 at 18:34
  • Yes, this looks like a network analysis problem where `igraph` or `tidygraph` could do the heavy lifting of identifying linked clusters. – Jon Spring Mar 04 '21 at 18:59

1 Answers1

0

I don't fully understand what you are trying to accomplish, but your current code could be truncated to the following:

same.pairs %>%
  count(Test, First, name = "ReplicateSize") %>% 
  count(Test, ReplicateSize, name = "Count") %>% 
  mutate(ReplicateSize = ReplicateSize + 1)

  Test ReplicateSize Count
1    1             2     2
2    1             3     1
3    2             2     2
4    2             3     2
Lennyy
  • 5,932
  • 2
  • 10
  • 23
  • Thanks for this streamlined code! I've added more detail to my question and an example of desired output. – hmg Mar 04 '21 at 19:36