Create Training and Test Dataset in R

Question

I want to create training and test data from mydata, which has 2673 observations and 23 variables. However, I am not able to create the test set just by simply subtracting the training data.

dim(mydata)
## [1] 2673   23
set.seed(1)
train = mydata[sample(1:nrow(mydata), 1000, replace=FALSE), ]
dim(train)
## [1] 1000   23

When I run the following, I got 19 warnings and the the result has 20,062 observations:

test = mydata[!train, ]
## There were 19 warnings (use warnings() to see them)
dim(test)
## [1] 20062    23

What am I doing wrong?

Relevant here http://stackoverflow.com/q/5963269/54964 – Léo Léopold Hertz 준영 May 17 '17 at 10:49 — Léo Léopold Hertz 준영, May 17 '17 at 10:49

gagolews · Accepted Answer · 2014-05-11T19:01:39.817

7

A possible solution involves storing the sampled indices in a separate named vector.

train_idx <- sample(1:nrow(mydata),1000,replace=FALSE)
train <- mydata[train_idx,] # select all these rows
test <- mydata[-train_idx,] # select all but these rows

Also, knowing that a data.frame's row.names attribute must consist of unique values, you may also set e.g.

test <- mydata[!(row.names(mydata) %in% row.names(train)), ]

But the second solution is 2x slower on mydata <- data.frame(a=1:100000, b=rep(letters, len=100000)), as measured by microbenchmark().

edited May 11 '14 at 19:01

answered May 11 '14 at 18:46

gagolews

12,836
2
50
75

Thanks! I still need to set the seed before creating `train_idx` correct? – PMa May 11 '14 at 18:49
If you wish to obtain reproducible results, call `set.seed(some_number)` before `sample()`. If that's not important to you, leave the seed as-is (it is automatically set according to system time + some other info). – gagolews May 11 '14 at 18:51

Create Training and Test Dataset in R

1 Answers1