we have some dataset:
every day sale count of 100 products from January to June,
our object is to predict each day sale count in July.
so how to split the dataset to training set, validation set
we have some dataset:
every day sale count of 100 products from January to June,
our object is to predict each day sale count in July.
so how to split the dataset to training set, validation set
Time series are the typical case where you should not split randomly (in general you should not split randomly when where there is significant example-example correlation).
Usually sales aren't a strictly dynamic time series (as stock prices) but using train_test_split
could be problematic.
You can obtain the desired cross-validation splits without using sklearn (e.g. sklearn: User defined cross validation for time series data, Pythonic Cross Validation on Time Series...).
70-80% for training is standard. Assuming uniform distribution of the examples, you can use data from January to April / May for the training set and the remaining records for validation.
Currently, to my knowledge, sklearn does not support rigorous cross-validation of time-dependent problems. All out-of-the-box cross-validation routines will construct training folds that include future information relative to test folds (e.g. [WIP] RollingWindow cross-validation #3638).
Moreover you should consider if your data are seasonal or have another obvious division in groups (e.g. geographic regions).