If you only have an old notebook showing a slice of one+ of the train/test subsets (eg X_test[0:5]
, y_train[-5:]
, etc), but you know the other parameters (eg [test_size | train_size, shuffle, stratify]) of the train_test_split()
call and can perfectly recreate X
and y
, you could try brute-forcing it by generating new splits with different random_state seeds and comparing the split to your known subset-slice and recording any random_state values producing matching (or close-enough that differences could just be floating-point weirdness) subset-slice values.
target_y_train = np.array([-5.482, -11.165, -13.926, -7.534, -8.323])
possible_random_state_values = []
for i in range(0, 1000):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=i)
if all(np.isclose(y_train[0:5], target_y_train)):
possible_random_state_values.append(i)
print(f"Possible random state value found: {i}")
If you don't get any possible seeds from the (0, 1000] range, increase the higher range. And when you get values, you can plug them into train_test_split(), compare other subset_slices if you have any, rerun your model training pipeline, and compare your output metrics.