0

I have a data set:

library(quantmod)
getSymbols('GOOG', from = "2010-05-01", to = "2017-05-01", src = "yahoo")

I am trying to split this data into train (nrow 1: to nrow 60% of the data), test (nrow 60% of the data to nrow 80% of the data) and finally validate (nrow 80% of the data to nrow 100% of the data).

I have the following;

library(caTools)
set.seed(123)
split <- sample.split(GOOG[Close], SplitRatio = 0.60)
train = subset(GOOG, split == TRUE)
nottrain = subset(GOOG, split == FALSE)

I am stuck here, I have been trying to split the "nottrain" data set into two parts with little luck.

I also believe that the data set gets split randomly (correct me if I am wrong). I am trying to split it as described above.

Any pointers in the right direction would be greatly appreciated.

user113156
  • 6,761
  • 5
  • 35
  • 81
  • `getSymbols` is not a R function... which package do you use? – Arthur Oct 31 '17 at 16:34
  • Have you had a look at [the task page for machine-learning](https://cran.r-project.org/web/views/MachineLearning.html)? – Arthur Oct 31 '17 at 16:36
  • apologies! use the quantmod package library(quantmod) – user113156 Oct 31 '17 at 16:40
  • i have the naswer fdor you, but before it, i need to ask a question. if you use 60% for train then 40% would remain for test. Unless, you want to have your test, and train data jave overlaps with each other – Sal-laS Oct 31 '17 at 16:43
  • I am trying to aceive (assuming we have 100 days of data for simplicity): day 1:60 would correspond to the 60%, day 61:80 would correspond to the first 20% after the 60%, day 81:100 would be the final 20%, (I understand the % are not accurate...another problem to think about) – user113156 Oct 31 '17 at 16:52
  • I guess the answers here work, just taking out the call to `sample` since you don't want randomness: https://stackoverflow.com/questions/36068963/r-how-to-split-a-data-frame-into-training-validation-and-test-sets As you can see, the proportions there don't end up being exact either... depending on #rows in the data. – Frank Oct 31 '17 at 17:00
  • @user113156 Does my answer help? – Sal-laS Nov 01 '17 at 04:28

2 Answers2

-1

@user113156,

"I am trying to aceive (assuming we have 100 days of data for simplicity): day 1:60 would correspond to the 60%, day 61:80 would correspond to the first 20% after the 60%, day 81:100 would be the final 20%, (I understand the % are not accurate...another problem to think about)"

Why don't you just put your data into a data frame, then just get the first 60% of the row and put it into a "train" df, the next 20% in a "nottrain1" df, and last 20% in "nottrain2" df? It seems like this would be the easiest way. Maybe I am misunderstanding the problem.

Ironman454
  • 21
  • 3
  • 1
    The usual advice is to (somehow) gain enough points to comment directly instead of posting comments as answers. If you consider this an answer already, maybe you could make it clearer by posting the code for what you describe? – Frank Oct 31 '17 at 17:00
-2

Can you please clarify your question? When splitting your data, are you trying to do the following: Split the dataset into the first 60% of the records for train, the next 40% for nottrain, and splitting nottrain in half? For example, if you have 1000 records, you want records 1-600 in train, records 601-800 in the first part of nottrain and 801-1000 in the second part of nottrain or do you want it all randomized? If you can clarify, we can help.

Ironman454
  • 21
  • 3