2

I have a data set called data, which I am splitting into 2 new data sets, which I will call test and train.

I want the splitting to be random, without replacement.

Using the code below, I get train to be a new data frame with 35 elements:

rows_in_test <-  35  # number of rows to randomly select 
rows_in_train <- nrow(data) - rows_in_test 
train <- data[sample(nrow(data), rows_in_test), ] 

Is there a nice way in R to assign the complement of train to a new data set called test? I am thinking there must be a function for this?

Frank
  • 66,179
  • 8
  • 96
  • 180
tumultous_rooster
  • 12,150
  • 32
  • 92
  • 149

1 Answers1

1
myData<-data.frame(a=c(1:20), b=c(101:120))
set.seed(123)#to be able to replicate random sampling later
trainRows<-runif(nrow(myData))>0.25 #randomly put aside 25% of the data
train<-myData[trainRows,]#has 13 rows
test<-myData[!trainRows,]#has 7 rows

#following method to select a fixed no. of samples - in this case selecting 5 rows
testRows2<-sort(sample(c(1:nrow(myData)), 5, replace=F))

train2<-myData[-testRows2, ]
test2<-myData[testRows2, ]
  • Seems to be a problem with consistency...I get 4 obs for test and 16 for train... – tumultous_rooster Feb 07 '14 at 23:29
  • I've added the line about setting the seed after making the post.I too get 16rows for train. As long as you set the seed before you make the call to `runif` you should get consistent results. –  Feb 07 '14 at 23:39