1

How to retrieve the random state of sklearn.model_selection.train_test_split?

Without setting the random_state, I split my dataset with train_test_split. Because the machine learning model trained on the split dataset performs quite well, I want to retrieve the random_state that was used to split the dataset. Is there something like numpy.random.get_state()

meTchaikovsky
  • 7,478
  • 2
  • 15
  • 34

3 Answers3

3

If you trace through the call stack of train_test_split, you'll find the random_state parameters is used like this:

from sklearn.utils import check_random_state
rng = check_random_state(self.random_state)
print(rng)

The relevant part of check_random_state is

def check_random_state(seed):
    if seed is None or seed is np.random:
        return np.random.mtrand._rand

If random_state=None, you get the default numpy.random.RandomState singleton, which you can use to generate new random numbers, e.g.:

print(rng.permutation(10))
print(rng.randn(10))

See these questions for more information:

mbatchkarov
  • 15,487
  • 9
  • 60
  • 79
  • 1
    So how would I use it in this situation: `kf = KFold(n_splits = 10, shuffle = True, random_state = None) rng = check_random_state(self.random_state) ` Because this gives me the following error: `NameError: name 'self' is not defined` – Leo Nov 12 '21 at 14:29
1

What do you mean?

If you wanna know which random_state you are using, you have to use random_state while running the function, for example:

X_train, X_test, y_train, y_test = train_test_split(
...     X, y, test_size=0.33, random_state=42)

by default its set to none see the docs.

Here are also further information to random_state.

Or do you mean this?

PV8
  • 5,799
  • 7
  • 43
  • 87
  • Thank you for your answer, what I mean is I split my dataset with `train_test_split` without setting the `random_state`, can I retrieve the `random_state` that was used by `train_test_split` afterwards? – meTchaikovsky Oct 27 '20 at 06:34
0

If you only have an old notebook showing a slice of one+ of the train/test subsets (eg X_test[0:5], y_train[-5:], etc), but you know the other parameters (eg [test_size | train_size, shuffle, stratify]) of the train_test_split() call and can perfectly recreate X and y, you could try brute-forcing it by generating new splits with different random_state seeds and comparing the split to your known subset-slice and recording any random_state values producing matching (or close-enough that differences could just be floating-point weirdness) subset-slice values.

target_y_train = np.array([-5.482, -11.165, -13.926,  -7.534, -8.323])
possible_random_state_values = []
for i in range(0, 1000):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=i)
    if all(np.isclose(y_train[0:5], target_y_train)):
        possible_random_state_values.append(i)
        print(f"Possible random state value found: {i}")

If you don't get any possible seeds from the (0, 1000] range, increase the higher range. And when you get values, you can plug them into train_test_split(), compare other subset_slices if you have any, rerun your model training pipeline, and compare your output metrics.

MattTriano
  • 1,532
  • 2
  • 16
  • 15