Problems with the random-state parameter on data splitting with sklearn

Question

When I look for the random -state parameter in sklearn's documentation, this is what I find:

random_state : int or RandomState Pseudo-random number generator state used for random sampling.

I don't understand very well what it is.

The accuracy for different classifiers changes notably depending on the number I write on the random-state parameter. Why is that? Which number should I set?

It is my first time on a Machine Learning project.

Possible duplicate of [Random state (Pseudo-random number)in Scikit learn](http://stackoverflow.com/questions/28064634/random-state-pseudo-random-numberin-scikit-learn) — Vivek Kumar, Mar 24 '17 at 14:51

score 3 · Answer 1 · edited May 23 '17 at 10:30

Setting the random_state parameter ensures that your data are split in exactly the same manner each time you run your code. This practice is important when you want to compare the accuracy of different models (e.g. different algorithms or additional features, or both): if you keep shuffling the deck in different ways while testing new approaches, how are you to know whether the increase or decrease in accuracy is due to the changes you've made to your model, versus being due to using slightly different train and test datasets?

As far as choosing the number for your random_state parameter: that's up to you. Some experiment with different values of the parameter and see for which random_state value the model performs best. It really depends on your application: is this a production-scale machine-learning model you're developing, or is it a model for a data science challenge? In the former case, it shouldn't matter much. In the latter case, I have known people who tune their model completely and then begin experimenting with different random_state parameters to bump up their accuracies. I don't necessarily agree with that practice, because it seems like another form of overfitting (see more here. I usually choose 100 because that number is funny to me -- there's really no logic behind it. Some people choose 42, others 1, etc.

See a more detailed example here.

I have already read the documentation you've passed me. However, I'm afraid I'm not clear on which random-state number should i set. Thank you for your speed! — Borja Fernández Antelo, Mar 24 '17 at 14:05
@BorjaFernándezAntelo Did you even read what I wrote? I detail in the second paragraph how you go about "choosing" your `random_state` parameter. — blacksite, Apr 11 '17 at 00:39

Problems with the random-state parameter on data splitting with sklearn

1 Answers1