0

I'd like to know how to make multiple sampling in R. For example, when I try dividing some data into 60(train data):40(validate data), I can write the code like this:

original.data = read.csv("~.csv", na.strings="")              
train.index = sample(c(1:dim(original.data)[1]), dim(original.data)[1]*0.6)

train.data = original.data[train.index,]
valid.data = original.data[-train.index,]

However, it is so hard to figure out making multiple sampling like dividing some data into 60:20:20.

I would appreciate if you make me know the best solution!

ulfelder
  • 5,305
  • 1
  • 22
  • 40
1Sun
  • 2,305
  • 5
  • 14
  • 21
  • 1
    Look here: https://stackoverflow.com/questions/17200114/how-to-split-data-into-training-testing-sets-using-sample-function. – akhetos May 23 '19 at 14:27
  • 1
    The only answer in that question that addresses more than two tests (suggested by "60:20:20") is [deleted](https://stackoverflow.com/a/39650418/) (and therefore not visible to most users), so unless OP is intended to *infer* the next step (use of `prob=`, used by none of the still-present answers), this is **not a duplicate** of 17200114. – r2evans May 23 '19 at 15:06

1 Answers1

2

If you want more than two sets, then the other solutions are close but you need just a little more. There are at least two options.

First:

set.seed(2)
table(samp <- sample(1:3, size = nrow(iris), prob = c(0.6, 0.2, 0.2), replace = TRUE))
#  1  2  3 
# 93 35 22 

nrow(iris) # 150
set1 <- iris[samp == 1,]
set2 <- iris[samp == 2,]
set3 <- iris[samp == 3,]

set1 <- iris[samp == 1,]
set2 <- iris[samp == 2,]
set3 <- iris[samp == 3,]
nrow(set1)
# [1] 93
nrow(set2)
# [1] 35
nrow(set3)
# [1] 22

Because it's random, you want always get your exact proportions.

Second:

If you must have exact proportions, you can do this:

ns <- nrow(iris) * c(0.6, 0.2, 0.2)
sum(ns)
# [1] 150
### in case of rounding (and sum != nrow) ... just fix one of ns

rep(1:3, times = ns)
#   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#  [46] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#  [91] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
# [136] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
set.seed(2)
head(samp <- sample(rep(1:3, times = ns)))
# [1] 1 2 1 1 3 3

set1 <- iris[samp == 1,]
set2 <- iris[samp == 2,]
set3 <- iris[samp == 3,]
nrow(set1)
# [1] 90
nrow(set2)
# [1] 30
nrow(set3)
# [1] 30

This can easily be generalized to support an arbitrary number of partitions.

r2evans
  • 141,215
  • 6
  • 77
  • 149