36

Can someone explain me what random_state means in below example?

import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42) 

Why is it hard coded to 42?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Saurabh
  • 1,592
  • 2
  • 14
  • 30
  • 1
    Does this answer your question? [Random state (Pseudo-random number) in Scikit learn](https://stackoverflow.com/questions/28064634/random-state-pseudo-random-number-in-scikit-learn) – Kim Kern Oct 26 '20 at 17:15

5 Answers5

95

Isn't that obvious? 42 is the Answer to the Ultimate Question of Life, the Universe, and Everything.

On a serious note, random_state simply sets a seed to the random generator, so that your train-test splits are always deterministic. If you don't set a seed, it is different each time.

Relevant documentation:

random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • 22
    That first sentence was more than enough. – Danrex Oct 10 '18 at 02:54
  • 1
    @cs95 Do I have to generate a new `random_state` for subsequent methods in my code? For example, if I set the random state as 42 for the `train_test_split`, do I set the random state also as 42 for the classifier I will be using on the split data? What about if I want to compare two different classifiers, do I use the same random state for both classifiers? – Pleastry Oct 27 '20 at 13:19
  • @Turtle I think you are looking to set a global seed so your pipeline is deterministic. This will only make the split deterministic, nothing else. Consider using something like np.random.seed or creating a random state object that is then reused across functions. – cs95 Oct 27 '20 at 18:22
  • but if you use it in train, test split do you still need to use it when you run each algorithm ? – vanetoj Nov 17 '21 at 19:41
  • How is the random_state saved? For example does it matter if I run my code on different Colab-Notebooks on different accounts? – Maxl Gemeinderat Jun 02 '22 at 12:38
  • I think the spec dictates that the seed be deterministic across platforms @MaxlGemeinderat but all bets are off the table if random seed is `None`. – cs95 Jul 23 '22 at 10:25
19

If you don't specify the random_state in the code, then every time you run(execute) your code a new random value is generated and the train and test datasets would have different values each time.

However, if a fixed value is assigned like random_state = 0 or 1 or 42 or any other integer then no matter how many times you execute your code the result would be the same .i.e, same values in train and test datasets.

Farzana Khan
  • 1,946
  • 1
  • 6
  • 9
10

Random state ensures that the splits that you generate are reproducible. Scikit-learn uses random permutations to generate the splits. The random state that you provide is used as a seed to the random number generator. This ensures that the random numbers are generated in the same order.

vumaasha
  • 2,765
  • 4
  • 27
  • 41
6

When the Random_state is not defined in the code for every run train data will change and accuracy might change for every run. When the Random_state = " constant integer" is defined then train data will be constant For every run so that it will make easy to debug.

kishore naidu
  • 61
  • 1
  • 3
2

The random state is simply the lot number of the set generated randomly in any operation. We can specify this lot number whenever we want the same set again.