how to split dataset to training set and validation set

Question

we have some dataset:

every day sale count of 100 products from January to June,

our object is to predict each day sale count in July.

so how to split the dataset to training set, validation set

Check this answer for a detailed summary : http://stackoverflow.com/questions/13610074/is-there-a-rule-of-thumb-for-how-to-divide-a-dataset-into-training-and-validatio — Harjatin, May 17 '16 at 16:12
`scikit-learn` has a useful helper function for splitting your data: http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html — numentar, May 17 '16 at 16:15
Possible duplicate of [where to save activation key](http://stackoverflow.com/questions/1360749/where-to-save-activation-key) — Prune, May 17 '16 at 16:55
The key consideration is the fact that your data is probably time dependent. I don't think scikit-learn has a way to deal with this. But here's a link that briefly discusses this and provides a possible solution: http://francescopochetti.com/pythonic-cross-validation-time-series-pandas-scikit-learn/ — mark s., May 17 '16 at 17:08

score 3 · Answer 1 · edited May 23 '17 at 11:59

Time series are the typical case where you should not split randomly (in general you should not split randomly when where there is significant example-example correlation).

Usually sales aren't a strictly dynamic time series (as stock prices) but using train_test_split could be problematic.

You can obtain the desired cross-validation splits without using sklearn (e.g. sklearn: User defined cross validation for time series data, Pythonic Cross Validation on Time Series...).

70-80% for training is standard. Assuming uniform distribution of the examples, you can use data from January to April / May for the training set and the remaining records for validation.

Currently, to my knowledge, sklearn does not support rigorous cross-validation of time-dependent problems. All out-of-the-box cross-validation routines will construct training folds that include future information relative to test folds (e.g. [WIP] RollingWindow cross-validation #3638).

Moreover you should consider if your data are seasonal or have another obvious division in groups (e.g. geographic regions).

how to split dataset to training set and validation set

1 Answers1