Trying to split data into train, test and validation sets (in chronological order)

Question

I have a data set:

library(quantmod)
getSymbols('GOOG', from = "2010-05-01", to = "2017-05-01", src = "yahoo")

I am trying to split this data into train (nrow 1: to nrow 60% of the data), test (nrow 60% of the data to nrow 80% of the data) and finally validate (nrow 80% of the data to nrow 100% of the data).

I have the following;

library(caTools)
set.seed(123)
split <- sample.split(GOOG[Close], SplitRatio = 0.60)
train = subset(GOOG, split == TRUE)
nottrain = subset(GOOG, split == FALSE)

I am stuck here, I have been trying to split the "nottrain" data set into two parts with little luck.

I also believe that the data set gets split randomly (correct me if I am wrong). I am trying to split it as described above.

Any pointers in the right direction would be greatly appreciated.

`getSymbols` is not a R function... which package do you use? — Arthur, Oct 31 '17 at 16:34
Have you had a look at [the task page for machine-learning](https://cran.r-project.org/web/views/MachineLearning.html)? — Arthur, Oct 31 '17 at 16:36
i have the naswer fdor you, but before it, i need to ask a question. if you use 60% for train then 40% would remain for test. Unless, you want to have your test, and train data jave overlaps with each other — Sal-laS, Oct 31 '17 at 16:43
I am trying to aceive (assuming we have 100 days of data for simplicity): day 1:60 would correspond to the 60%, day 61:80 would correspond to the first 20% after the 60%, day 81:100 would be the final 20%, (I understand the % are not accurate...another problem to think about) — user113156, Oct 31 '17 at 16:52
I guess the answers here work, just taking out the call to `sample` since you don't want randomness: https://stackoverflow.com/questions/36068963/r-how-to-split-a-data-frame-into-training-validation-and-test-sets As you can see, the proportions there don't end up being exact either... depending on #rows in the data. — Frank, Oct 31 '17 at 17:00

score -1 · Answer 1 · answered Oct 31 '17 at 16:58

@user113156,

"I am trying to aceive (assuming we have 100 days of data for simplicity): day 1:60 would correspond to the 60%, day 61:80 would correspond to the first 20% after the 60%, day 81:100 would be the final 20%, (I understand the % are not accurate...another problem to think about)"

Why don't you just put your data into a data frame, then just get the first 60% of the row and put it into a "train" df, the next 20% in a "nottrain1" df, and last 20% in "nottrain2" df? It seems like this would be the easiest way. Maybe I am misunderstanding the problem.

The usual advice is to (somehow) gain enough points to comment directly instead of posting comments as answers. If you consider this an answer already, maybe you could make it clearer by posting the code for what you describe? — Frank, Oct 31 '17 at 17:00

score -2 · Answer 2 · answered Oct 31 '17 at 16:54

-2

Can you please clarify your question? When splitting your data, are you trying to do the following: Split the dataset into the first 60% of the records for train, the next 40% for nottrain, and splitting nottrain in half? For example, if you have 1000 records, you want records 1-600 in train, records 601-800 in the first part of nottrain and 801-1000 in the second part of nottrain or do you want it all randomized? If you can clarify, we can help.

answered Oct 31 '17 at 16:54

Ironman454

21
3

Yes I am trying to achieve this. I have perhaps tried to overcomplicate the situation though... – user113156 Oct 31 '17 at 17:01
You should post this as comment. – Joao Vitorino Oct 31 '17 at 17:02

Trying to split data into train, test and validation sets (in chronological order)

2 Answers2