1

I am trying to randomly sample 50% of the data for each of the group following Stratified random sampling from data frame. A reproducible example using mtcars dataset in R looks like below. What I dont understand is, the sample index clearly shows a group of gear labeled as '5', but when the index is applied to the mtcars dataset, the sampled data mtcars2 does not contain any record from gear='5'. What went wrong? Thank you very much.

> set.seed(14908141)
> index=tapply(1:nrow(mtcars),mtcars$gear,function(x){sample(length(x),length(x)*0.5)})
> index
$`3`
[1]  6  7 14  4 12  9 13

$`4`
[1] 12  7  8  4  6  5

$`5`
[1] 5 1

> mtcars2=mtcars[unlist(index),]
> table(mtcars2$gear)

 3  4 
12  3 
user11806155
  • 121
  • 5

1 Answers1

0

I think the approach you've done creates a number 1:length(mtcars$gear) for each gear group so you will have repeat row numbers for each group. Then, when you subset it isn't working, see in your output above you have row number 7 in both gear group 3 and 4.

Base R

I would use split first to split by gear:

res <- split(mtcars, mtcars$gear)

then I run over this list using lapply and sample 50% of them that way:

res2 <- lapply(res, function(x) {
  x[sample(1:nrow(x), nrow(x)*0.5, FALSE), ]
    }
)

if you would like one dataset at the end (instead of a list) you can combine using do.call:

final_df <- do.call(rbind, res2)

dplyr

A simpler approach would be:

library(dplyr)
mtcars %>% 
  group_by(gear) %>% 
  sample_frac(., 0.5)
user63230
  • 4,095
  • 21
  • 43