How to divide dataset in r randomly

Question

I have a dataset of around 50 contiguous days. I want to divide it into training and test data sets, such that each 5 of the days of the week are in the training set, and 2 of the days of the week are in the test set.

The 2 days of the test set should be selected randomly. Like not always e.g. 1st 2 days are selected.

How could I do that?

Is there any function for this in R? Currently this is how I am dividing data into training and test set but it's probably doing such that test and train data times are very near to each other so always very high MSR value resulting.

set.seed(100)

train <- sample(nrow(dataset1), 0.7 * nrow(dataset1), replace = FALSE)
TrainSet <- dataset1[train,]
#scale (TrainSet, center = TRUE, scale = TRUE)
ValidSet <- dataset1[-train,]
#scale (ValidSet, center = TRUE, scale = TRUE)
summary(TrainSet)
summary(ValidSet)

Example Data:

data
#            timestamp var1  var2  var3 var5
#1 2018-07-20 13:40:00   12  0.00 30.12   10
#2 2018-07-20 13:45:00   12  0.10 10.15   10
#3 2018-07-20 13:50:00    2 11.00 19.22   17
#4 2018-07-20 13:55:00   22  3.05 23.31    3

dput(data)
structure(list(timestamp = c("2018-07-20 13:50:00", "2018-07-20 13:52:00", 
"2018-07-20 13:54:00", "2018-07-20 13:56:00"), var1 = c(12, 12, 
2, 22), var2 = c(0, 0.1, 11, 3.05), var3 = c(30.12, 10.15, 19.22, 
23.31), var5 = c(10L, 10L, 17L, 3L)), class = "data.frame", row.names = c(NA, 
-4L))

Please post some example data. See https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — alan ocallaghan, Nov 12 '19 at 17:52
In this case, the columns var1 to var5 don't really matter, but the question is about randomly selected 2 days per week, dates spanning at least 2 weeks are probably necessary to illustrate the problem. If there is a possibility of incomplete weeks, please specify what you want to happen in that case and also include an incomplete week in the sample data. (Will you always have at least one observation per day? What if your most recent week only has 3 days of data?) — Gregor Thomas, Nov 13 '19 at 13:53
Feel free to split your data the way you want, but just take into account that it is very likely in this situation that you will have significant correlation between your training dataset and your holdout. It's generally a better practice to split on a year basis than on a day basis to avoid overoptimistic results on your holdout. — J.P. Le Cavalier, Nov 13 '19 at 13:53
@Gregor: Thank you for the reply. There are observations every 5 minutes. Last week is an incomplete week, it has 6 days. They are from 20th of July 2019 to 22nd of Aug 2019. — XCeptable, Nov 13 '19 at 15:55

score 1 · Answer 1 · answered Nov 12 '19 at 18:00

1

This is an example of how to partition data such as this

set.seed(42)
days_of_the_week <- letters[1:7]

df <- data.frame(day = days_of_the_week, value = rnorm(105))

train_days <- sample(unique(df$day), 2)
test_days <- setdiff(df$day, train_days)

test_data <- df[df$day %in% test_days, ]
train_data <- df[df$day %in% train_days, ]

answered Nov 12 '19 at 18:00

alan ocallaghan

3,116
17
37

Thank you for the help. As you can see from my sample data, my data is in different form than you put in answer. Your data frame has days while my data is saved as each record an observation each 5 minutes consecutively for several weeks. How do I apply this method of answer to that. – XCeptable Nov 13 '19 at 21:00

How to divide dataset in r randomly

1 Answers1