Let's say I have run different tests to see if some objects are identical. The testing was done pairwise, and I have a dataframe containing the pairs of objects that are the same:
same.pairs <- data.frame(Test=c(rep(1, 4), rep(2, 6)),
First=c("A", "A", "B", "D", "A", "A", "B", "C", "C", "D"),
Second=c("B", "C", "C", "E", "B", "E", "E", "D", "G", "G"))
##
Test First Second
1 A B
1 A C
1 B C
1 D E
2 A B
2 A E
2 B E
2 C D
2 C G
2 D G
From this I can see that in Test 1, because A = B and A = C and B = C, then A = B = C and these 3 objects belong in one set of size 3.
I want to know the full size of the sets for each test. For this example, I want to know that for Test 1, one set is 3 identical objects (A, B, C) and one set is 2 (D, E), and for Test 2, two sets are size 3 ((A, B, E) and (C, D, G)). I don't need to know which objects are in each set, just the size of the sets and the counts of how many sets are that size:
Test ReplicateSize Count
1 3 1
1 2 1
2 3 2
Is there an elegant way to do this? I thought I had it with this:
sets <- same.pairs %>%
group_by(Test, First) %>%
summarize(ReplicateSize=n()) %>%
# add 1 to size because above only counting second genotype, need to include first
mutate(ReplicateSize=ReplicateSize+1) %>%
select(-First) %>%
ungroup() %>%
group_by(Test, ReplicateSize) %>%
summarize(Count=n()) %>%
arrange(Test, ReplicateSize)
##
Test ReplicateSize Count
1 2 2
1 3 1
2 2 2
2 3 2
but this is double counting some of the sets as, for example in Test 1, B&C are counted as a set of size 2 instead of ignored as they are already part of a set with A. I'm not sure how to skip rows where the First object has already been observed as the Second object without making a complicated for loop.
Any guidance appreciated.