I'm performing a cross-validation in order to classify properly. First, I was using the function StratifiedKfold from scikit-learn. At some point, I wanted to make more iterations and I changed to StratifiedShuffleSplit. With this new function the results I was obtaining changed. Finally, I realized that if I specify a random_state I get again similar results to those I was obtaining when using StratifiedKfold for the CV.
In summary, if I specify the random_state, for different values I get slightly different results, similar to those I was obtaining with the StratifiedKfold (with one iteration, or computing the shuffling by myself, as here). However, if the random_state is none or is not specified the results I obtain change completely.
I checked that when random_state is None, the train and test indexes are different and are stratified, as expected.
I don't have experience with random number generators, but this does not have any sense to me
Looking at the code I realized that when random_state is None the function check_random_state is called. This function, if seed is none returns the RandomState singleton used by np.random (link).
I write you the problematic piece of code. If I change the commented line by the one below I obtain different results.
import numpy as np
import sklearn as skl
(...)
#skCVs=skl.cross_validation.StratifiedShuffleSplit(classes,n_iter=iterations*kfoldCV,test_size = 1/float(kfoldCV),random_state=5)
skCVs=skl.cross_validation.StratifiedShuffleSplit(classes,n_iter=iterations*kfoldCV,test_size = 1/float(kfoldCV))
for train,test in skCVs:
(classification, ...)
I'm using version 0.14 of sklearn.
Do you have any explanation or clue that could help understanding what is happening?