confusion regarding generating the best decision tree by setting random_state value as an integer?

Question

Following from the discussion here confused about random_state in decision tree of scikit learn

Since I am setting my random_state to be 1, I do have consistent metrics because I am generating the same tree every-time. But when random_state is set to default -- > none, then the different trees that are generated each time have different performance metrics, some are better than others and some are worse. How is that we then get the best possible metric or in other words how to find out what int value to be set for random-state that will give me the tree that has the best accuracy and kappa stats.

score 1 · Answer 1 · answered Dec 20 '19 at 14:38

You should not optimize the value of the random_state. In general, you don't want to fix it apart if you want someone else to come to the exact same numbers as you (e.g. to reproduce figures, etc.).

Let's give an example that might highlight why you should not do that. Make an experiment where you perform a K-fold cross-validation. Each split will lead to a different model (tree in your example).

If I select the best model found during this cross-validation, my conclusions will be over-optimistic. I should instead look at the mean performance and the fluctuation. These variations will indeed tell what is the impact of giving different data to my model. They will also allow me to quickly know if the difference of performance between 2 models is significative: e.g. 2 models with a mean performance difference of 0.01 and std. dev. of 0.1 should not let you conclude that there is a model better than another.

There are additional answers around SO regarding the topic: https://stats.stackexchange.com/a/264008/121348

since I want to compare the accuracy of my decision trees after performing different experiments, wont different random state_value for different experiments make my metrics incomparable to each other? — Rohan, Dec 20 '19 at 17:29
No. For each experiment, you will get a mean accuracy computed by cross-validation (and some std. dev.) you can compare them or make statistical tests to draw some conclusions. — glemaitre, Dec 20 '19 at 22:44

score 0 · Answer 2 · answered Dec 20 '19 at 09:57

The random state adds a degree of randomness to the model and as you have correctly understood, the resulting different performance metrics will result in different models and accuracy.

To find the best possible parameters for the model and optimize the accuracy, you can make use of GridSearchCV. It is a type of cross validation, that makes use of a parameter grid (range of possible parameter values and their combinations) to optimize the results.

The above method is computationally intensive as it generates, trains and test several models, but this way you can find the best possible parameter values (without trial and error methods using random_state) and optimize the accuracy of the model.

You should not optimize the random state. This is just wrong. — glemaitre, Dec 20 '19 at 14:39

confusion regarding generating the best decision tree by setting random_state value as an integer?

2 Answers2