1

I want to create training and test data from mydata, which has 2673 observations and 23 variables. However, I am not able to create the test set just by simply subtracting the training data.

dim(mydata)
## [1] 2673   23
set.seed(1)
train = mydata[sample(1:nrow(mydata), 1000, replace=FALSE), ]
dim(train)
## [1] 1000   23

When I run the following, I got 19 warnings and the the result has 20,062 observations:

test = mydata[!train, ]
## There were 19 warnings (use warnings() to see them)
dim(test)
## [1] 20062    23

What am I doing wrong?

BroVic
  • 979
  • 9
  • 26
PMa
  • 1,751
  • 7
  • 22
  • 28

1 Answers1

7

A possible solution involves storing the sampled indices in a separate named vector.

train_idx <- sample(1:nrow(mydata),1000,replace=FALSE)
train <- mydata[train_idx,] # select all these rows
test <- mydata[-train_idx,] # select all but these rows

Also, knowing that a data.frame's row.names attribute must consist of unique values, you may also set e.g.

test <- mydata[!(row.names(mydata) %in% row.names(train)), ]

But the second solution is 2x slower on mydata <- data.frame(a=1:100000, b=rep(letters, len=100000)), as measured by microbenchmark().

gagolews
  • 12,836
  • 2
  • 50
  • 75
  • Thanks! I still need to set the seed before creating `train_idx` correct? – PMa May 11 '14 at 18:49
  • If you wish to obtain reproducible results, call `set.seed(some_number)` before `sample()`. If that's not important to you, leave the seed as-is (it is automatically set according to system time + some other info). – gagolews May 11 '14 at 18:51