How to get the number of unique rows (on a column subset) in a dataframe using dplyr?

Question

This is the toy version of my real dataframe:

df <- data.frame(
  sample = c("s1", "s1", "s1", "s2", "s2", "s2", "s1",  "s3", "s4"),
  snp = c("snp1", "snp1", "snp1", "snp1", "snp1", "snp1", "snp2", "snp2", "snp2"),
  random_column = 1:9
)

I'm interested in counting the number of unique sample-snp pairs and return that value to each row. In this case: s1 and s2 have snp1 (so size should be 2 for all the duplicate rows, 1-6), and s1, s3 and s4 have snp2 (so size should be 3 for rows 7-9). This would be the expected output:

  sample random   snp  size
   (chr)  (int) (chr) (int)
1     s1      1  snp1     2
2     s1      2  snp1     2
3     s1      3  snp1     2
4     s2      4  snp1     2
5     s2      5  snp1     2
6     s2      6  snp1     2
7     s1      7  snp2     3
8     s3      8  snp2     3
9     s4      8  snp2     3

I guess I could do this and then some type of left-join, but I'm wondering if there is an easier way:

df[!duplicated(df[,c('sample','snp')]),] %>% group_by(snp) %>% summarize(size = n())

Do you mean `df %>% group_by(snp) %>% mutate(size = n_distinct(sample))`? — talat, Jun 20 '16 at 19:48
Brilliant! I can't believe I missed that. You can add it as the answer if you want — nachocab, Jun 20 '16 at 20:19
It's a duplicate (though I don't have the right dupe target at hand atm). Feel free to answer it yourself and accept it if you like — talat, Jun 20 '16 at 20:27
@Rockbar, not exactly, since `aggregate` _aggregates_ the results (i.e. returns 1 row per group) whereas OP just wants to create a new column (or you would need to join the aggregated results back to the original data) A more direct base R equivalent is therefore `ave`. — talat, Jun 20 '16 at 20:32

How to get the number of unique rows (on a column subset) in a dataframe using dplyr?

0 Answers0