Setting the random_state
parameter ensures that your data are split in exactly the same manner each time you run your code. This practice is important when you want to compare the accuracy of different models (e.g. different algorithms or additional features, or both): if you keep shuffling the deck in different ways while testing new approaches, how are you to know whether the increase or decrease in accuracy is due to the changes you've made to your model, versus being due to using slightly different train and test datasets?
As far as choosing the number for your random_state
parameter: that's up to you. Some experiment with different values of the parameter and see for which random_state
value the model performs best. It really depends on your application: is this a production-scale machine-learning model you're developing, or is it a model for a data science challenge? In the former case, it shouldn't matter much. In the latter case, I have known people who tune their model completely and then begin experimenting with different random_state
parameters to bump up their accuracies. I don't necessarily agree with that practice, because it seems like another form of overfitting (see more here. I usually choose 100
because that number is funny to me -- there's really no logic behind it. Some people choose 42
, others 1
, etc.
See a more detailed example here.