Classification results depend on random_state?

Question

I want to implement a AdaBoost model using scikit-learn (sklearn). My question is similar to another question but it is not totally the same. As far as I understand, the random_state variable described in the documentation is for randomly splitting the training and testing sets, according to the previous link. So if I understand correctly, my classification results should not be dependent on the seeds, is it correct? Should I be worried if my classification results turn out to be dependent on the random_state variable?

Related question: https://stats.stackexchange.com/questions/263999/is-random-state-a-parameter-to-tune — dzieciou, Jun 12 '19 at 09:42

score 4 · Accepted Answer · edited May 23 '17 at 12:09

Your classification scores will depend on random_state. As @Ujjwal rightly said, it is used for splitting the data into training and test test. Not just that, a lot of algorithms in scikit-learn use the random_state to select the subset of features, subsets of samples, and determine the initial weights etc.

For eg.

Tree based estimators will use the random_state for random selections of features and samples (like DecisionTreeClassifier, RandomForestClassifier).
In clustering estimators like Kmeans, random_state is used to initialize centers of clusters.
SVMs use it for initial probability estimation
Some feature selection algorithms also use it for initial selection
And many more...

Its mentioned in the documentation that:

If your code relies on a random number generator, it should never use functions like numpy.random.random or numpy.random.normal. This approach can lead to repeatability issues in tests. Instead, a numpy.random.RandomState object should be used, which is built from a random_state argument passed to the class or function.

Do read the following questions and answers for better understanding:

score 3 · Answer 2 · answered Feb 27 '17 at 00:35

3

It does matter. When your training set differs then your trained state also changes. For a different subset of data you can end up with a classifier which is little different from the one trained with some other subset.

Hence, you should use a constant seed like 0 or another integer, so that your results are reproducible.

answered Feb 27 '17 at 00:35

Ujjwal

1,849
2
17
37

so should i believe the prediction? or should i treat this random_state as another hyperparameter? it really doesn't make a lot of sense to have different prediction because of different seeds for pseudo random number generator... – kensaii Feb 27 '17 at 00:43
1

It makes a perfect sense because due to random seeding everytime you train over a different subset of data. It is not a hyper-parameter really. you should just set it to some fixed number and then across systems you will be able to reproduce ur numbers. – Ujjwal Feb 27 '17 at 00:45
right... i understand. so if the random_state = 0 and random_state = 10 give different results, which one should i trust? given that my dataset is sort of noisy, but not completely. – kensaii Feb 27 '17 at 01:07
2

generally the seed which gives closest accuracy to average accuracy should be used for safety. If u find it useful, pls accept my answer – Ujjwal Feb 27 '17 at 01:09
thanks, but i'd like to give other ppl chances. your answer is great, i appreciate it! – kensaii Feb 27 '17 at 01:11
Correct answer is not going to change, but ok. That's fine. – Ujjwal Feb 27 '17 at 01:12
you could train with different random states and average the results of the models....probably not as good as averaging another type of model with it though. as far as which prediction to "trust"...it depends on what's at stack if you're wrong. – user1269942 Feb 27 '17 at 01:32

Classification results depend on random_state?

2 Answers2

Linked