Based on this answer: Random state (Pseudo-random number)in Scikit learn, if I use the same integer (say 42) as random_state
, then each time it does train-test split, it should give the same split (i.e. same data instances in train during each run, and same for test)
But,
for test_size in test_sizes: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42) clf = SVC(C=penalty, probability=False)
Suppose I have a code like this. In this case, I am changing the
test_size
in each loop. How will it effect whatrandom_state
does? Will it shuffle everything OR keep as many rows intact as possible and shift a few rows from train to test (or vice versa) according to the test size?Also,
random_state
is a parameter for some classifiers likesklearn.svm.SVC
andsklearn.tree.DecisionTreeClassifier
. I have a code like this:clf = tree.DecisionTreeClassifier(random_state=0) scores = cross_validate(clf, X_train, y_train, cv=cv) cross_val_test_score = round(scores['test_score'].mean(), prec) clf.fit(X_train, y_train)
What does
random_state
exactly do here? Because it is used while defining the classifier. It is not supplied with data yet. I got the following from http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html:
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
Suppose the following line is executed multiple times for each of multiple test-sizes:
clf = tree.DecisionTreeClassifier(random_state=0)
If I keep
random_state=int(test_size*100)
, does that mean that for each test-size, the results will come out to be the same? (and for different test-sizes, they will be different?)(Here,
tree.DecisionTreeClassifier
could be replaced with other classifiers who also userandom_state
, such assklearn.svm.SVC
. I assume all classifier userandom_state
in a similar way?)