4

I wanted to split my training data in to 70% training, 15% testing and 15% validation. I am using the createDataPartition() function of the caret package. I am splitting it like the following

train <- read.csv("Train.csv")
test <- read.csv("Test.csv")

split=0.70
trainIndex <- createDataPartition(train$age, p=split, list=FALSE)
data_train <- train[ trainIndex,]
data_test <- train[-trainIndex,]

Is there any way of splitting into training, testing and validation using createDataPartition() like the following H2o approach?

data.hex <- h2o.importFile("Train.csv")
splits <- h2o.splitFrame(data.hex, c(0.7,0.15), destination_frames = c("train","valid","test"))
train.hex <- splits[[1]]
valid.hex <- splits[[2]]
test.hex  <- splits[[3]]
Miguel Rayon Gonzalez
  • 1,513
  • 1
  • 11
  • 13
Mahsolid
  • 433
  • 4
  • 12
  • 28

2 Answers2

9

A method using the sample() function in base R is

splitSample <- sample(1:3, size=nrow(data.hex), prob=c(0.7,0.15,0.15), replace = TRUE)
train.hex <- data.hex[splitSample==1,]
valid.hex <- data.hex[splitSample==2,]
test.hex <- data.hex[splitSample==3,]
Erin LeDell
  • 8,704
  • 1
  • 19
  • 35
lmo
  • 37,904
  • 9
  • 56
  • 69
  • `> nrow(data.hex) [1] 25192 > nrow(train.hex) [1] 8398 > valid.hex <- data.hex[splitSample==2,] > nrow(valid.hex) [1] 8397 > test.hex<- data.hex[splitSample==3,] > nrow(test.hex) [1] 8397` but the difference between them is only 1. is this correct? – Mahsolid Apr 07 '16 at 17:39
  • 1
    Oops. Forgot the size argument. – lmo Apr 07 '16 at 17:46
  • 2
    Note that this is (quasi) random, so the sizes will be approximately equal to 0.7, 0.15, 0.15, but not exactly. For replication purposes, you would want to set the seed above the first line: `set.seed(some integer)` – lmo Apr 07 '16 at 17:53
0

Take a look at train,validation, test split model in CARET in R. The idea is to use createDataPartition() twice. First time p=0.7 to create 70% train and 30% remaining data. Second time p=0.5 on remaining data to create 15% testing and 15% validate.

Perceptron
  • 399
  • 3
  • 11