This is the toy version of my real dataframe:
df <- data.frame(
sample = c("s1", "s1", "s1", "s2", "s2", "s2", "s1", "s3", "s4"),
snp = c("snp1", "snp1", "snp1", "snp1", "snp1", "snp1", "snp2", "snp2", "snp2"),
random_column = 1:9
)
I'm interested in counting the number of unique sample-snp pairs and return that value to each row. In this case: s1 and s2 have snp1 (so size
should be 2 for all the duplicate rows, 1-6), and s1, s3 and s4 have snp2 (so size
should be 3 for rows 7-9). This would be the expected output:
sample random snp size
(chr) (int) (chr) (int)
1 s1 1 snp1 2
2 s1 2 snp1 2
3 s1 3 snp1 2
4 s2 4 snp1 2
5 s2 5 snp1 2
6 s2 6 snp1 2
7 s1 7 snp2 3
8 s3 8 snp2 3
9 s4 8 snp2 3
I guess I could do this and then some type of left-join, but I'm wondering if there is an easier way:
df[!duplicated(df[,c('sample','snp')]),] %>% group_by(snp) %>% summarize(size = n())