48

I have a data frame made by almost 50,000 rows spread in 15 different IDs (every ID has thousands of observations). Data frame looks like:

        ID  Year    Temp    ph
1       P1  1996    11.3    6.80
2       P1  1996    9.7     6.90
3       P1  1997    9.8     7.10
...
2000    P2  1997    10.5    6.90
2001    P2  1997    9.9     7.00
2002    P2  1997    10.0    6.93

I want to take 500 random rows for every ID (so 500 for P1, 500 for P2,....) and create a new df. I try:

new_df<-df[df$ID %in% sample(unique(dfID),500),]

But it takes randomly one ID, while I need 500 random rows for every ID.

camille
  • 16,432
  • 18
  • 38
  • 60
matteo
  • 4,683
  • 9
  • 41
  • 77
  • 3
    If you came here for the reverse question of using all rows but sampling from some of the 15 different IDs: https://stackoverflow.com/questions/37149649/randomly-sample-groups – Christopher Oezbek Feb 07 '21 at 19:57

9 Answers9

92

This is available as the slice_sample function in dplyr:

library(dplyr)
new_df <- df %>% group_by(ID) %>% slice_sample(n=500)

In older versions of R, the function was called sample_n, which has been deprecated.

drhagen
  • 8,331
  • 8
  • 53
  • 82
19

Try this:

library(plyr)
ddply(df,.(ID),function(x) x[sample(nrow(x),500),])
joran
  • 169,992
  • 32
  • 429
  • 468
13

Here is one approach in base R.

First, the prerequisite sample data to work with:

set.seed(1)
mydf <- data.frame(ID = rep(1:3, each = 5), matrix(rnorm(45), ncol = 3))
mydf
#    ID         X1          X2          X3
# 1   1 -0.6264538 -0.04493361  1.35867955
# 2   1  0.1836433 -0.01619026 -0.10278773
# 3   1 -0.8356286  0.94383621  0.38767161
# 4   1  1.5952808  0.82122120 -0.05380504
# 5   1  0.3295078  0.59390132 -1.37705956
# 6   2 -0.8204684  0.91897737 -0.41499456
# 7   2  0.4874291  0.78213630 -0.39428995
# 8   2  0.7383247  0.07456498 -0.05931340
# 9   2  0.5757814 -1.98935170  1.10002537
# 10  2 -0.3053884  0.61982575  0.76317575
# 11  3  1.5117812 -0.05612874 -0.16452360
# 12  3  0.3898432 -0.15579551 -0.25336168
# 13  3 -0.6212406 -1.47075238  0.69696338
# 14  3 -2.2146999 -0.47815006  0.55666320
# 15  3  1.1249309  0.41794156 -0.68875569

Second, the sampling:

do.call(rbind, 
        lapply(split(mydf, mydf$ID), 
               function(x) x[sample(nrow(x), 3), ]))
#      ID         X1          X2         X3
# 1.2   1  0.1836433 -0.01619026 -0.1027877
# 1.1   1 -0.6264538 -0.04493361  1.3586796
# 1.5   1  0.3295078  0.59390132 -1.3770596
# 2.10  2 -0.3053884  0.61982575  0.7631757
# 2.9   2  0.5757814 -1.98935170  1.1000254
# 2.8   2  0.7383247  0.07456498 -0.0593134
# 3.13  3 -0.6212406 -1.47075238  0.6969634
# 3.12  3  0.3898432 -0.15579551 -0.2533617
# 3.15  3  1.1249309  0.41794156 -0.6887557

There is also strata from the sampling package, which is convenient when you want to sample different sizes from each group:

# install.packages("sampling")
library(sampling)
set.seed(1)
x <- strata(mydf, "ID", size = c(2, 3, 2), method = "srswor")
getdata(mydf, x)
#            X1          X2         X3 ID ID_unit Prob Stratum
# 2   0.1836433 -0.01619026 -0.1027877  1       2  0.4       1
# 5   0.3295078  0.59390132 -1.3770596  1       5  0.4       1
# 6  -0.8204684  0.91897737 -0.4149946  2       6  0.6       2
# 8   0.7383247  0.07456498 -0.0593134  2       8  0.6       2
# 9   0.5757814 -1.98935170  1.1000254  2       9  0.6       2
# 14 -2.2146999 -0.47815006  0.5566632  3      14  0.4       3
# 15  1.1249309  0.41794156 -0.6887557  3      15  0.4       3
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
13

In case you have big datasets, a data.table solution could go like this:

library(data.table)

# Generate 26 mil rows random data
set.seed(2023-08-11) # anchor the  random number generator (RNG) state for reproducibility 
dt <- data.table(c1 = sample(length(LETTERS)*10^6), 
                 c2 = sample(LETTERS, replace = TRUE))

# For each letter, sample 500 rows
set.seed(2023-08-11) # anchor the RNG again, as we use `sample` again
dt_sample <- dt[, .SD[sample(x = .N, size = 500)], by = c2]

# We indeed sampled 500 rows for each letter
dt_sample[, .N, by = c2][order(c2)]
#>     c2   N
#>  1:  A 500
#>  2:  D 500
#>  3:  G 500
#>  4:  I 500
#>  5:  M 500
#>  6:  N 500
#>  7:  O 500
#>  8:  P 500
#>  9:  Q 500
#> 10:  R 500
#> 11:  S 500
#> 12:  T 500
#> 13:  U 500
#> 14:  V 500
#> 15:  W 500
#> 16:  Y 500
#> 17:  Z 500

Created on 2019-04-23 by the reprex package (v0.2.1)

In case your data is unbalanced in the sense that some groups happen to be smaller (as number of rows) than your desired sample size, then you need to set a defensive trick like sample size should be min(500, .N) - see sample random rows within each group in a data.table. So like:

dt[, .SD[sample(x = .N, size = min(500, .N))], by = c2]

Valentin_Ștefan
  • 6,130
  • 2
  • 45
  • 68
  • This is great! Should I set the seed also before calling `dt_sample <- dt[, .SD[sample(x = .N, size = 500)], by = c2]` to have it reproducible? – umbe1987 Aug 11 '23 at 08:43
  • 1
    @umbe1987, yes, it is safer/good practice to have the seed set again. Thanks for pointing that out. I updated the code. – Valentin_Ștefan Aug 11 '23 at 11:48
2

An approach if on of the IDs is < 500. Here I used the mtcars set:

n <- 8
df <- mtcars
df$ID <- df$cyl

FUN <- function(x, n) {
    if (length(x) <= n) return(x)
    x[x %in% sample(x, n)]
}

df[unlist(lapply(split(1:nrow(df), df$ID), FUN, n = 8)), ]
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
1

Here's an elegant solution based on data.table. You can randomly draw IDs from a panel data set (balanced or unbalanced) in three simple steps:

Step 1: Store unique IDs from your original data set in a vector (my data set is called "main" and the identifier is called "id"):

ids <- unique(main$id)

Step 2: Randomly draw IDs from the vector from step 1. In the example below, I randomly draw 50 IDs from the vector "ids" and store them in the new vector "draw":

draw <- ids %>% sample(50)

Step 3: Subset rows in your original data set based on matches with the IDs drawn in step 2.

rsample <- main[main$id %in% draw, ]
0
mydata1 is your original data(not tested)

mydata2<- split(mydata1,mydata1$ID)
names(mydata2)<-paste0("mydata2",1:length(levels(ID))) 
mysample<-Map(function(x) x[sample((1:nrow(x)),size=500,replace=FALSE),], mydata2)

library(plyr)# for rbinding the mysample
ldply(mysample)
Metrics
  • 15,172
  • 7
  • 54
  • 83
0

Although this is not very elegant solution, but it may work.

library(data.table)
df <- data.table(df)
f <- list()
for(i in unique(df1$ID)){
 f[[i]] <- df1[id == i][sample(.N,(500))]
  }
 dfnew <- rbindlist(f)
Varn K
  • 400
  • 4
  • 8
0
library(data.table) #1
df <- data.table(df) #2
df[,group_num := sample(2,.N,replace = TRUE,prob = c(500,.N-500)/.N),by = "ID"] #3
df_sample = df[group_num == 1,] #4

or you can change line #3 and #4 to:

df[,random_num := sample(.N,.N),by="ID"]
df_sample  = df[random_num <=500,]
cloudscomputes
  • 1,278
  • 13
  • 19