Random Sample 1 row for each unique column value in R

Question

I have a dataset that consists of 2 columns idunique and match_no

Reproducible example here

idunique <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
match_no <- c(1, 1, 1, 1, 2, 2, 3, 3, 4, 5)

df <- data.frame(idunique, match_no)

   idunique match_no
         1        1
         2        1
         3        1
         4        1
         5        2
         6        2
         7        3
         8        3
         9        4
        10        5

I need to randomly sample occurrences of match_no from the database and extract x amount of unique occurrences.

example output would a random subset of idunique based on randomly sampled match_no

  idunique match_no
        1        1
        5        2
        7        3
        9        4
       10        5

The real database is 6 million rows long with ~ 2000 duplicates of each match_no so I need the solution to be able to change the sample size.

`df %>% group_by(match_no) %>% sample_n(1)` where 1 is the sample size. See https://dplyr.tidyverse.org/reference/sample.html — missuse, Apr 01 '21 at 16:22

score 2 · Answer 1 · answered Apr 01 '21 at 16:24

2

With data.table, we can do

library(data.table)
setDT(df)[df[, sample(.I, 1), match_no]$V1]

answered Apr 01 '21 at 16:24

akrun

874,273
37
540
662

1

What about this:```setDT(df1)[, .SD[sample(x = .N, size = sample_size)], by = match_no][]``` – M-- Apr 01 '21 at 16:31
1

@M-- It can be also done, but I think `.I` will be more faster – akrun Apr 01 '21 at 16:33

score 2 · Accepted Answer · answered Apr 01 '21 at 16:24

2

df %>% group_by(match_no) %>% sample_n(1)

answered Apr 01 '21 at 16:24

Kilian Murphy

321
2
14

Random Sample 1 row for each unique column value in R

2 Answers2