0

I have a data frame with two categorical variables.

samples<-c("A","A","A","A","B","B")
groups<-c(1,1,1,2,1,1)
df<- data.frame(samples,groups)
df
  samples groups
1       A      1
2       A      1
3       A      1
4       A      2
5       B      1
6       B      1

The result that I would like to have is for each given observation (sample-group) to downsample (randomly, this is important) the data frame to a maximum of X rows and keep all obervation for which appear less than X times. In the example here X=2. Is there an easy way to do this? The issue that I have is that observation 4 (A,2) appears only once, thus dplyr sample_n would not work.

desired output

  samples groups
1       A      1
2       A      1
3       A      2
4       B      1
5       B      1
Kaizen
  • 131
  • 1
  • 11
  • 1
    I assume that the data.frame consists of more than 2 columns and that not all of them are used for grouping – s_baldur Oct 29 '20 at 13:29
  • Does this answer your question? [R (and dplyr?) - Sampling from a dataframe by group, up to a maximum sample size of n](https://stackoverflow.com/questions/52816423/r-and-dplyr-sampling-from-a-dataframe-by-group-up-to-a-maximum-sample-size) – camille Dec 24 '21 at 15:51

2 Answers2

2

You can sample minimum of number of rows or x for each group :

library(dplyr)

x <- 2
df %>% group_by(samples, groups) %>% sample_n(min(n(), x))

#  samples groups
#  <chr>    <dbl>
#1 A            1
#2 A            1
#3 A            2
#4 B            1
#5 B            1

However, note that sample_n() has been super-seeded in favor of slice_sample but n() doesn't work with slice_sample. There is an open issue here for it.


However, as @tmfmnk mentioned we don't need to call n() here. Try :

df %>% group_by(samples, groups) %>% slice_sample(n = x)
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • 1
    Not sure whether it produces a consistent solution across all scenarios, but according the documentation of `slice_sample()` `If n is greater than the number of rows in the group (or prop > 1), the result will be silently truncated to the group size. If the proportion of a group size is not an integer, it is rounded down.`. So here, using `slice_sample(n = 2)` produces the same results as your code. – tmfmnk Oct 29 '20 at 11:07
  • 1
    @tmfmnk Thanks, I somehow missed that in the documentation. I have updated the answer based on your suggestion. – Ronak Shah Oct 29 '20 at 11:47
  • Thank you guys, `slice_sample` is perfectly working. – Kaizen Oct 29 '20 at 13:42
  • None of this is working for me with dplyr 1.0.9. While the documentation still reads the same as @tmfmnk comment, if nrow is less than the sample number I get an error that it cannot take a sample larger than the population when replace=FALSE. When trying the min() approach, I get an error that n has to be a constant. – Kevin May 18 '22 at 13:46
1

One option with data.table:

df[df[, .I[sample(.N, min(.N, X))], by = .(samples, groups)]$V1]

   samples groups
1:       A      1
2:       A      1
3:       A      2
4:       B      1
5:       B      1
s_baldur
  • 29,441
  • 4
  • 36
  • 69