Randomly sample data frame into 3 groups in R

Question

Objective: Randomly divide a data frame into 3 samples.

one sample with 60% of the rows
other two samples with 20% of the rows
samples should not have duplicates of others (i.e. sample without replacement).

Here's a clunky solution:

allrows <- 1:nrow(mtcars)

set.seed(7)
trainrows <- sample(allrows, replace = F, size = 0.6*length(allrows))
test_cvrows <- allrows[-trainrows]
testrows <- sample(test_cvrows, replace=F, size = 0.5*length(test_cvrows))
cvrows <- test_cvrows[-which(test_cvrows %in% testrows)]

train <- mtcars[trainrows,]
test <- mtcars[testrows,]
cvr <- mtcars[cvrows,]

There must be something easier, perhaps in a package. dplyr has the sample_frac function, but that seems to target a single sample, not a split into multiple.

Close, but not quite the answer to this question: Random Sample with multiple probabilities in R

This may be of interest: http://stackoverflow.com/a/3911586/1191259 — Frank, Dec 01 '15 at 20:04
@Frank that is a pretty slick answer. I'd rather not have to hard code the partition rows though, I'm going to keep it in mind for future use. — Minnow, Dec 01 '15 at 20:17

Ben Bolker · Accepted Answer · 2015-12-01T21:18:40.437

16

Do you need the partitioning to be exact? If not,

set.seed(7)
ss <- sample(1:3,size=nrow(mtcars),replace=TRUE,prob=c(0.6,0.2,0.2))
train <- mtcars[ss==1,]
test <- mtcars[ss==2,]
cvr <- mtcars[ss==3,]

should do it.

Or, as @Frank says in comments, you can split() the original data to keep them as elements of a list:

mycars <- setNames(split(mtcars,ss), c("train","test","cvr"))

edited Dec 01 '15 at 21:18

answered Dec 01 '15 at 19:49

Ben Bolker

211,554
25
370
453

3

I'd use `split` to not lose track of 'em `mycars = setNames(split(mtcars,ss), c("train","test","cvr"))` – Frank Dec 01 '15 at 19:53
1

Partitioning does not need to be exact. This works and I like @Frank 's addendum. – Minnow Dec 01 '15 at 20:14
3

If you want an exact partitioning, replace the `ss` line with: `ss <- sample(rep(1:3, diff(floor(nrow(yourdataset) * c(0, 0.6, 0.8, 1)))))` – liori Sep 25 '17 at 23:46

score 1 · Answer 2 · answered Dec 01 '15 at 19:53

1

Not the prettiest solution (especially for larger samples), but it works.

n = nrow(mtcars)
#use different rounding for differet sizes/proportions
times =rep(1:3,c(0.6*n,0.2*n,0.2*n))
ntimes = length(times)
if (ntimes < n)
    times = c(times,sample(1:3,n-ntimes,prob=c(0.6,0.2,0.2),replace=FALSE))
sets = sample(times)
df1 = mtcars[sets==1,]
df2 = mtcars[sets==2,]
df3 = mtcars[sets==3,]

answered Dec 01 '15 at 19:53

Max Candocia

4,294
35
58

1

A simpler `times` might be `findInterval(1:n, n*c(0,.6,.8))` – Frank Dec 01 '15 at 19:58

score 0 · Answer 3 · answered Dec 01 '15 at 21:33

Options without replacement

Using caret package.

library(caret)

inTrain <- createDataPartition(mtcars$mpg, p = 0.6, list = FALSE)
train <- mtcars[inTrain, ]
inTest <- createDataPartition(mtcars$mpg[-inTrain], list = FALSE)
test <- mtcars[-inTrain,][inTest, ]
cvr <- mtcars[-inTrain,][-inTest, ]

Base package.

## splitData
# y column of data to create split on
# p list of percentage split
splitData <- function(y, p = c(0.5)){
  if(sum(p) > 1){
    stop("sum of p cannot exceed 1")
  }

  rows <- 1:length(y)

  res <- list()

  n_sample = round(length(rows) * p)
  for( size in n_sample){
    inSplit <-  sample.int(length(rows), size)
    res <- c(res, list(rows[inSplit]))
    rows <- rows[-inSplit]
  }

  if(sum(as.matrix(p)) < 1){
    res <- c(res, list(rows))
  }

  res
}

split_example_2 <- splitData(mtcars$mpg, p = c(0.6, 0.2))
split_example_3 <- splitData(mtcars$mpg)

score 0 · Answer 4 · answered May 11 '19 at 20:13

If you want to get exact and reproducible numbers for each group (split as close to the proportions as you can achieve, bearing in mind the group sizes must be whole numbers), rather than allow the group sizes to vary randomly each time you perform your random split, try:

sample_size <- nrow(mtcars)
set_proportions <- c(Training = 0.6, Validation = 0.2, Test = 0.2)
set_frequencies <- diff(floor(sample_size * cumsum(c(0, set_proportions))))
mtcars$set <- sample(rep(names(set_proportions), times = set_frequencies))

Then you can split into a list of dataframes simply by

mtcars <- split(mtcars, mtcars$set)

so e.g. the dataframe for the validation set is now accessed as mtcars$Validation, or alternatively you can split into separate data frames as:

mtcars_train <- mtcars[mtcars$set == "Training", ]
mtcars_validation <- mtcars[mtcars$set == "Validation", ]
mtcars_test <- mtcars[mtcars$set == "Test", ]

In some cases, like this one, you can't split the data exactly 60%, 20%, 20% but this method guarantees the sizes of the two 20% sets shouldn't be more than one apart from each other:

> set_frequencies
  Training Validation       Test 
        19          6          7

Check it has worked as expected:

> table(mtcars$set)

      Test   Training Validation 
         7         19          6

(Based on the answer by Ben Bolker and the comment by liori.)

Randomly sample data frame into 3 groups in R

4 Answers4

Linked

Related