0

I have a dataset that consists of 2 columns idunique and match_no

Reproducible example here

idunique <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
match_no <- c(1, 1, 1, 1, 2, 2, 3, 3, 4, 5)

df <- data.frame(idunique, match_no)

   idunique match_no
         1        1
         2        1
         3        1
         4        1
         5        2
         6        2
         7        3
         8        3
         9        4
        10        5

I need to randomly sample occurrences of match_no from the database and extract x amount of unique occurrences.

example output would a random subset of idunique based on randomly sampled match_no

  idunique match_no
        1        1
        5        2
        7        3
        9        4
       10        5

The real database is 6 million rows long with ~ 2000 duplicates of each match_no so I need the solution to be able to change the sample size.

M--
  • 25,431
  • 8
  • 61
  • 93
Kilian Murphy
  • 321
  • 2
  • 14
  • 1
    `df %>% group_by(match_no) %>% sample_n(1)` where 1 is the sample size. See https://dplyr.tidyverse.org/reference/sample.html – missuse Apr 01 '21 at 16:22

2 Answers2

2

With data.table, we can do

library(data.table)
setDT(df)[df[, sample(.I, 1), match_no]$V1]
akrun
  • 874,273
  • 37
  • 540
  • 662
2

df %>% group_by(match_no) %>% sample_n(1)

Kilian Murphy
  • 321
  • 2
  • 14