Take random sample by group

Question

I have a data frame made by almost 50,000 rows spread in 15 different IDs (every ID has thousands of observations). Data frame looks like:

        ID  Year    Temp    ph
1       P1  1996    11.3    6.80
2       P1  1996    9.7     6.90
3       P1  1997    9.8     7.10
...
2000    P2  1997    10.5    6.90
2001    P2  1997    9.9     7.00
2002    P2  1997    10.0    6.93

I want to take 500 random rows for every ID (so 500 for P1, 500 for P2,....) and create a new df. I try:

new_df<-df[df$ID %in% sample(unique(dfID),500),]

But it takes randomly one ID, while I need 500 random rows for every ID.

If you came here for the reverse question of using all rows but sampling from some of the 15 different IDs: https://stackoverflow.com/questions/37149649/randomly-sample-groups — Christopher Oezbek, Feb 07 '21 at 19:57

drhagen · Answer 1 · 2022-01-01T14:36:49.093

92

This is available as the slice_sample function in dplyr:

library(dplyr)
new_df <- df %>% group_by(ID) %>% slice_sample(n=500)

In older versions of R, the function was called sample_n, which has been deprecated.

edited Jan 01 '22 at 14:36

answered Aug 30 '16 at 18:42

drhagen

8,331
8
53
82

4

Worked well on large data frame. – Martin Thøgersen Nov 10 '17 at 09:22
Did not work when called from a for loop within a function. Works perfectly outside the function. Anyone has a hint why? – Marina May 22 '18 at 21:44
2

Non-Standard Evaluation/Standard Evaluation issues: https://stackoverflow.com/a/34187076/5088194 – leerssej Jan 09 '19 at 20:34
only issue I had with this solution is you can only take the maximum number of samples of the smallest group. Say one ID has 499 rows, but you need 500 for all others, it will throw an error. – HaplessEcologist Apr 02 '21 at 19:25
Just an FYI since dplyr verbs change a lot: in dplyr v.1 this is superceded by `slice_sample` – camille Dec 24 '21 at 15:45

score 19 · Accepted Answer · answered Aug 15 '13 at 18:11

19

Try this:

library(plyr)
ddply(df,.(ID),function(x) x[sample(nrow(x),500),])

answered Aug 15 '13 at 18:11

joran

169,992
32
429
468

A5C1D2H2I1M1N2O1R2T1 · Answer 3 · 2013-08-15T18:20:08.787

Here is one approach in base R.

First, the prerequisite sample data to work with:

set.seed(1)
mydf <- data.frame(ID = rep(1:3, each = 5), matrix(rnorm(45), ncol = 3))
mydf
#    ID         X1          X2          X3
# 1   1 -0.6264538 -0.04493361  1.35867955
# 2   1  0.1836433 -0.01619026 -0.10278773
# 3   1 -0.8356286  0.94383621  0.38767161
# 4   1  1.5952808  0.82122120 -0.05380504
# 5   1  0.3295078  0.59390132 -1.37705956
# 6   2 -0.8204684  0.91897737 -0.41499456
# 7   2  0.4874291  0.78213630 -0.39428995
# 8   2  0.7383247  0.07456498 -0.05931340
# 9   2  0.5757814 -1.98935170  1.10002537
# 10  2 -0.3053884  0.61982575  0.76317575
# 11  3  1.5117812 -0.05612874 -0.16452360
# 12  3  0.3898432 -0.15579551 -0.25336168
# 13  3 -0.6212406 -1.47075238  0.69696338
# 14  3 -2.2146999 -0.47815006  0.55666320
# 15  3  1.1249309  0.41794156 -0.68875569

Second, the sampling:

do.call(rbind, 
        lapply(split(mydf, mydf$ID), 
               function(x) x[sample(nrow(x), 3), ]))
#      ID         X1          X2         X3
# 1.2   1  0.1836433 -0.01619026 -0.1027877
# 1.1   1 -0.6264538 -0.04493361  1.3586796
# 1.5   1  0.3295078  0.59390132 -1.3770596
# 2.10  2 -0.3053884  0.61982575  0.7631757
# 2.9   2  0.5757814 -1.98935170  1.1000254
# 2.8   2  0.7383247  0.07456498 -0.0593134
# 3.13  3 -0.6212406 -1.47075238  0.6969634
# 3.12  3  0.3898432 -0.15579551 -0.2533617
# 3.15  3  1.1249309  0.41794156 -0.6887557

There is also strata from the sampling package, which is convenient when you want to sample different sizes from each group:

# install.packages("sampling")
library(sampling)
set.seed(1)
x <- strata(mydf, "ID", size = c(2, 3, 2), method = "srswor")
getdata(mydf, x)
#            X1          X2         X3 ID ID_unit Prob Stratum
# 2   0.1836433 -0.01619026 -0.1027877  1       2  0.4       1
# 5   0.3295078  0.59390132 -1.3770596  1       5  0.4       1
# 6  -0.8204684  0.91897737 -0.4149946  2       6  0.6       2
# 8   0.7383247  0.07456498 -0.0593134  2       8  0.6       2
# 9   0.5757814 -1.98935170  1.1000254  2       9  0.6       2
# 14 -2.2146999 -0.47815006  0.5566632  3      14  0.4       3
# 15  1.1249309  0.41794156 -0.6887557  3      15  0.4       3

Valentin_Ștefan · Answer 4 · 2023-08-11T11:51:05.893

In case you have big datasets, a data.table solution could go like this:

library(data.table)

# Generate 26 mil rows random data
set.seed(2023-08-11) # anchor the  random number generator (RNG) state for reproducibility 
dt <- data.table(c1 = sample(length(LETTERS)*10^6), 
                 c2 = sample(LETTERS, replace = TRUE))

# For each letter, sample 500 rows
set.seed(2023-08-11) # anchor the RNG again, as we use `sample` again
dt_sample <- dt[, .SD[sample(x = .N, size = 500)], by = c2]

# We indeed sampled 500 rows for each letter
dt_sample[, .N, by = c2][order(c2)]
#>     c2   N
#>  1:  A 500
#>  2:  D 500
#>  3:  G 500
#>  4:  I 500
#>  5:  M 500
#>  6:  N 500
#>  7:  O 500
#>  8:  P 500
#>  9:  Q 500
#> 10:  R 500
#> 11:  S 500
#> 12:  T 500
#> 13:  U 500
#> 14:  V 500
#> 15:  W 500
#> 16:  Y 500
#> 17:  Z 500

^{Created on 2019-04-23 by the reprex package (v0.2.1)}

In case your data is unbalanced in the sense that some groups happen to be smaller (as number of rows) than your desired sample size, then you need to set a defensive trick like sample size should be min(500, .N) - see sample random rows within each group in a data.table. So like:

dt[, .SD[sample(x = .N, size = min(500, .N))], by = c2]

This is great! Should I set the seed also before calling `dt_sample <- dt[, .SD[sample(x = .N, size = 500)], by = c2]` to have it reproducible? — umbe1987, Aug 11 '23 at 08:43
@umbe1987, yes, it is safer/good practice to have the seed set again. Thanks for pointing that out. I updated the code. — Valentin_Ștefan, Aug 11 '23 at 11:48

score 2 · Answer 5 · answered Aug 15 '13 at 18:13

2

An approach if on of the IDs is < 500. Here I used the mtcars set:

n <- 8
df <- mtcars
df$ID <- df$cyl

FUN <- function(x, n) {
    if (length(x) <= n) return(x)
    x[x %in% sample(x, n)]
}

df[unlist(lapply(split(1:nrow(df), df$ID), FUN, n = 8)), ]

answered Aug 15 '13 at 18:13

Tyler Rinker

108,132
65
322
519

score 1 · Answer 6 · answered Nov 11 '21 at 09:44

Here's an elegant solution based on data.table. You can randomly draw IDs from a panel data set (balanced or unbalanced) in three simple steps:

Step 1: Store unique IDs from your original data set in a vector (my data set is called "main" and the identifier is called "id"):

ids <- unique(main$id)

Step 2: Randomly draw IDs from the vector from step 1. In the example below, I randomly draw 50 IDs from the vector "ids" and store them in the new vector "draw":

draw <- ids %>% sample(50)

Step 3: Subset rows in your original data set based on matches with the IDs drawn in step 2.

rsample <- main[main$id %in% draw, ]

score 0 · Answer 7 · answered Aug 15 '13 at 18:14

mydata1 is your original data(not tested)

mydata2<- split(mydata1,mydata1$ID)
names(mydata2)<-paste0("mydata2",1:length(levels(ID))) 
mysample<-Map(function(x) x[sample((1:nrow(x)),size=500,replace=FALSE),], mydata2)

library(plyr)# for rbinding the mysample
ldply(mysample)

score 0 · Answer 8 · answered Oct 24 '18 at 18:07

0

Although this is not very elegant solution, but it may work.

library(data.table)
df <- data.table(df)
f <- list()
for(i in unique(df1$ID)){
 f[[i]] <- df1[id == i][sample(.N,(500))]
  }
 dfnew <- rbindlist(f)

answered Oct 24 '18 at 18:07

Varn K

400
4
8

cloudscomputes · Answer 9 · 2019-04-18T04:15:43.740

0

library(data.table) #1
df <- data.table(df) #2
df[,group_num := sample(2,.N,replace = TRUE,prob = c(500,.N-500)/.N),by = "ID"] #3
df_sample = df[group_num == 1,] #4

or you can change line #3 and #4 to:

df[,random_num := sample(.N,.N),by="ID"]
df_sample  = df[random_num <=500,]

edited Apr 18 '19 at 04:15

answered Apr 18 '19 at 04:04

cloudscomputes

1,278
13
19

Take random sample by group

9 Answers9

Linked

Related