Stratified random sampling from data frame_follow up

Question

I am trying to randomly sample 50% of the data for each of the group following Stratified random sampling from data frame. A reproducible example using mtcars dataset in R looks like below. What I dont understand is, the sample index clearly shows a group of gear labeled as '5', but when the index is applied to the mtcars dataset, the sampled data mtcars2 does not contain any record from gear='5'. What went wrong? Thank you very much.

> set.seed(14908141)
> index=tapply(1:nrow(mtcars),mtcars$gear,function(x){sample(length(x),length(x)*0.5)})
> index
$`3`
[1]  6  7 14  4 12  9 13

$`4`
[1] 12  7  8  4  6  5

$`5`
[1] 5 1

> mtcars2=mtcars[unlist(index),]
> table(mtcars2$gear)

 3  4 
12  3

user63230 · Accepted Answer · 2020-06-24T14:36:06.320

0

I think the approach you've done creates a number 1:length(mtcars$gear) for each gear group so you will have repeat row numbers for each group. Then, when you subset it isn't working, see in your output above you have row number 7 in both gear group 3 and 4.

Base R

I would use split first to split by gear:

res <- split(mtcars, mtcars$gear)

then I run over this list using lapply and sample 50% of them that way:

res2 <- lapply(res, function(x) {
  x[sample(1:nrow(x), nrow(x)*0.5, FALSE), ]
    }
)

if you would like one dataset at the end (instead of a list) you can combine using do.call:

final_df <- do.call(rbind, res2)

dplyr

A simpler approach would be:

library(dplyr)
mtcars %>% 
  group_by(gear) %>% 
  sample_frac(., 0.5)

edited Jun 24 '20 at 14:36

answered Jun 24 '20 at 13:34

user63230

4,095
21
43

Thank you very much for the quick response. I noticed the problem that same record was included in more than one group. Is it possible to correct my code above with tapply() or some other base R function? – user11806155 Jun 24 '20 at 14:08
1

see a base type solution above – user63230 Jun 24 '20 at 14:36
do.call(), thats exactly what I need, thank you very much. – user11806155 Jun 24 '20 at 14:58

Stratified random sampling from data frame_follow up

1 Answers1