How to find the optimal values for splitting the data into test and train?

Question

I am building a python application in which i want to forecast the values of PM2.5 over a month. I am using polynomial regression and I have trained the algorithm to split data into 30%test data and 70%train data. I am using this line of code to train the algorithm:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42,shuffle=True)

But i have noticed that if i give the random_state different integers, the mean squared error differs and also the accuracy of the forecast. How can I find the optimal parameters for the train_test_split method so that the forecast has the most accuracy?

This is a data science question, not a Pandas question. Read up on "hyperparameter optimization" — Josh Friedlander, Mar 05 '20 at 14:47
@iuliana iuliana `random_state` selects random data for split. If you need to find best data split then use loop in `range(0,100)` for `random_state` and find best score for all states. — Zaraki Kenpachi, Mar 05 '20 at 14:50

score 0 · Answer 1 · answered Mar 05 '20 at 15:15

0

How much does the accuracy vary when you change the random seed?

You can use k-fold cross-validation to find the best split, however, I am not sure you want the one with the highest accuracy. You want your model to generalize. You should go for the one where you have enough training data and a test set that is representative of the real-world test data the model will encounter.

answered Mar 05 '20 at 15:15

Dipam7

63
1
6

well, mean squared error varies from 40 to 55 in my example and I want to be able to even get the value lower than 40... I have also attached the code: https://github.com/iulianastroia/work/blob/master/predict_algorithms_november/polynomial_regression_MEAN_DATA_NOVEMBER.py – moro_92 Mar 05 '20 at 15:30
I did not go through the whole code but seems like you are predicting something based on date. In that case, you can never split randomly. If you have data from 1st January to 31st December, you can take data till 31st October as train and the rest as validation. Also, a useful feature engineering tip would be to split the date column into things like 'day', 'is_weekend', 'is_weekday' and so on. All these attributes are already available in the datetime data type. – Dipam7 Mar 07 '20 at 15:35
I have November values from 1st november to 30. I want to use 70% of the data as train. Can't I use random for this? I need to compare the actual values to the forecasted ones. – moro_92 Mar 08 '20 at 13:12

How to find the optimal values for splitting the data into test and train?

1 Answers1

Linked