0

we have some dataset:

every day sale count of 100 products from January to June,

our object is to predict each day sale count in July.

so how to split the dataset to training set, validation set

176coding
  • 2,933
  • 4
  • 17
  • 18
  • Check this answer for a detailed summary : http://stackoverflow.com/questions/13610074/is-there-a-rule-of-thumb-for-how-to-divide-a-dataset-into-training-and-validatio – Harjatin May 17 '16 at 16:12
  • 1
    `scikit-learn` has a useful helper function for splitting your data: http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html – numentar May 17 '16 at 16:15
  • Possible duplicate of [where to save activation key](http://stackoverflow.com/questions/1360749/where-to-save-activation-key) – Prune May 17 '16 at 16:55
  • The key consideration is the fact that your data is probably time dependent. I don't think scikit-learn has a way to deal with this. But here's a link that briefly discusses this and provides a possible solution: http://francescopochetti.com/pythonic-cross-validation-time-series-pandas-scikit-learn/ – mark s. May 17 '16 at 17:08

1 Answers1

3

Time series are the typical case where you should not split randomly (in general you should not split randomly when where there is significant example-example correlation).

Usually sales aren't a strictly dynamic time series (as stock prices) but using train_test_split could be problematic.

You can obtain the desired cross-validation splits without using sklearn (e.g. sklearn: User defined cross validation for time series data, Pythonic Cross Validation on Time Series...).

70-80% for training is standard. Assuming uniform distribution of the examples, you can use data from January to April / May for the training set and the remaining records for validation.

Currently, to my knowledge, sklearn does not support rigorous cross-validation of time-dependent problems. All out-of-the-box cross-validation routines will construct training folds that include future information relative to test folds (e.g. [WIP] RollingWindow cross-validation #3638).

Moreover you should consider if your data are seasonal or have another obvious division in groups (e.g. geographic regions).

Community
  • 1
  • 1
manlio
  • 18,345
  • 14
  • 76
  • 126