2

I'm performing a cross-validation in order to classify properly. First, I was using the function StratifiedKfold from scikit-learn. At some point, I wanted to make more iterations and I changed to StratifiedShuffleSplit. With this new function the results I was obtaining changed. Finally, I realized that if I specify a random_state I get again similar results to those I was obtaining when using StratifiedKfold for the CV.

In summary, if I specify the random_state, for different values I get slightly different results, similar to those I was obtaining with the StratifiedKfold (with one iteration, or computing the shuffling by myself, as here). However, if the random_state is none or is not specified the results I obtain change completely.

I checked that when random_state is None, the train and test indexes are different and are stratified, as expected.

I don't have experience with random number generators, but this does not have any sense to me

Looking at the code I realized that when random_state is None the function check_random_state is called. This function, if seed is none returns the RandomState singleton used by np.random (link).

I write you the problematic piece of code. If I change the commented line by the one below I obtain different results.

import numpy as np
import sklearn as skl

(...)
#skCVs=skl.cross_validation.StratifiedShuffleSplit(classes,n_iter=iterations*kfoldCV,test_size = 1/float(kfoldCV),random_state=5)
skCVs=skl.cross_validation.StratifiedShuffleSplit(classes,n_iter=iterations*kfoldCV,test_size = 1/float(kfoldCV))

for train,test in skCVs:

   (classification, ...)

I'm using version 0.14 of sklearn.

Do you have any explanation or clue that could help understanding what is happening?

Community
  • 1
  • 1
Argitzen
  • 783
  • 1
  • 8
  • 7

2 Answers2

5

(Stratified)ShuffleSplit shuffles the data at random prior to splitting. The (pseudo-)randomness is controlled by the random_state constructor parameter. The default None value will mean that each new call will yield a different shuffling. To get deterministic shuffling you have the option to pass an integer seed.

ogrisel
  • 39,309
  • 12
  • 116
  • 125
  • Thanks. I see your point, but then I expect to get similar results when I specify different integer seeds, or when I don't pass any, right? The problem is that I get some results for different manually specified seeds, but I get different outcomes when I leave the random_state as None. It's like depending on the random number generator my results change and that does not make any sense to me. – Argitzen Apr 04 '14 at 09:43
  • What do you mean your "results change"? To me it sounds like the expected behavior. If the results would not be impacted by randomly shuffling cross-validation, why would you do cross-validation in the first place? If you want a better estimate of the mean validation score (a narrower estimate of standard error of the mean), just increase the number of iterations. – ogrisel Apr 05 '14 at 19:55
  • With "results change" I mean expected behavior. If I run a 10fold CV, the only thing I can do to increase the amount of data without shuffling is to use a 15fold or a 20fold CV. That's why I switched to the shuffling version, to be able to increase the number of iterations as I need. My problem is that I get different behavior when I run the classification with StratifiedShuffleSplit specifying a random_state (any integer) or if it is None. I expect the same results when the seed is given manually or if it is taken from some random number generator from np.random. – Argitzen Apr 07 '14 at 15:13
  • The mean validation is significantly higher with shuffling? This is a typically symptom of a breakage of the i.i.d. assumption. When you shuffle the data you can significantly overfit on non-iid data. This happens when you have temporal dependencies or when your measurements come from different subjects / experiments / sessions grouped in consecutive samples. You need a specific scheme that splits on group boundaries so that samples from a given subject are either all in the train fold or the validation fold. – ogrisel Jul 22 '14 at 09:12
  • That could make sense. There could be some temporal dependencies in my data (indeed, it's very likely), and therefore shuffling could overfit due to the non-iid data. Thanks! – Argitzen Jul 23 '14 at 09:54
1

I am also no expert for random generators but for what I can understand, a different RandomState type is called if you do not define the random_state. Here is the explanation I found:

"If seed is None, then RandomState will try to read data from /dev/urandom (or the Windows analogue) if available or seed from the clock otherwise."[1]

"If size is an integer, then a 1-D array filled with generated values is returned. " [1]

You can see the code of the two different random generators being called in "check_random_state" here [2].

[1] http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.RandomState.html

[2] https://github.com/scikit-learn/scikit-learn/blob/0.14.X/sklearn/utils/validation.py

Does this help you?

pequetrefe
  • 106
  • 3
  • Thanks. I see I'm using different random generators if I specify RandomState or if I don't. The problem is that I don't understand why I get different results for different random number generators. – Argitzen Apr 03 '14 at 16:43
  • Your welcome Argitzen. You will see that in a None seeded generator then a different seed is taken each time, but if you set up a seed for the generator then the same "randomness" appears every time because it comes from the same seed. Maybe you should check this post: http://stackoverflow.com/questions/9023660/how-to-generate-a-repeatable-random-number-sequence – pequetrefe Apr 03 '14 at 17:52
  • I agree. But then it should be equivalent to run the code many times with different manually specified seeds, or without passing any seed. The only difference would be that a different random number generator is used and that should not change the results, but it does... I don't know if I explained properly the problem. – Argitzen Apr 04 '14 at 09:46