3

I'm trying to write a unit test for some of my code that uses scikit-learn. However, my unit tests seem to be non-deterministic.

AFAIK, the only places in my code where scikit-learn uses any randomness are in its LogisticRegression model and its train_test_split, so I have the following:

RANDOM_SEED = 5
self.lr = LogisticRegression(random_state=RANDOM_SEED)
X_train, X_test, y_train, test_labels = train_test_split(docs, labels, test_size=TEST_SET_PROPORTION, random_state=RANDOM_SEED)

But this doesn't seem to work -- even when I pass a fixed docs and a fixed labels, the prediction probabilities on a fixed validation set vary from run to run.

I also tried adding a numpy.random.seed(RANDOM_SEED) call at the top of my code, but that didn't seem to work either.

Is there anything I'm missing? Is there a way to pass a seed to scikit-learn in a single place, so that seed is used throughout all of scikit-learn's invocations?

serv-inc
  • 35,772
  • 9
  • 166
  • 188
John
  • 31
  • 1
  • 1
  • 2
  • 2
    It's very likely that there is something else wrong in your code! Using a seed in LR and Splitting will be enough to make sure it's behaving deterministically! – sascha Nov 22 '16 at 19:49
  • 2
    I'm not sure if it will solve your determinism problem, but this isn't the right way to use a fixed seed with `scikit-learn`. Instantiate a `prng=numpy.random.RandomState(RANDOM_SEED)` instance, then pass that as `random_state=prng` to each individual function. If you just pass `RANDOM_SEED`, each individual function will restart and give the same numbers in different places, causing bad correlations. – Robert Kern Nov 22 '16 at 21:01
  • @RobertKern Can you elaborate? I don't quite understand what you are trying to explain. But of course using an int-seed is a valid approach of making these functions deterministic. Maybe you are talking about problems with distributed-seeding but even if so, i can't understand where that is coming from and there also much better approaches then. – sascha Nov 22 '16 at 21:07
  • Determinism isn't the only important thing. Statistical independence is also important, and you don't get that by passing the same integer seed to multiple `scikit-learn` functions in the same pipeline. You want exactly one `RandomState` instance to be shared by all functions in the pipeline. – Robert Kern Nov 22 '16 at 21:33
  • @RobertKern That depends on the environment / task (and of course the PRNG), but is not applying to the OP's problem here. – sascha Nov 22 '16 at 21:37
  • No, it's just the next problem the OP will encounter after he solves the determinacy (probably by finding another omitted part of the pipeline that takes a `random_state=` argument). That's why I put this in a comment. I will state categorically that the way I mentioned is the one correct way to use a specified seed for the OP's environment/task and PRNG. – Robert Kern Nov 22 '16 at 23:24
  • If you want more help, you could post more code. – serv-inc Jun 29 '17 at 19:19

1 Answers1

4
from sklearn import datasets, linear_model
iris = datasets.load_iris()
(X, y) = iris.data, iris.target
RANDOM_SEED = 5
lr = linear_model.LogisticRegression(random_state=RANDOM_SEED)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=RANDOM_SEED)
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

produced 0.93333333333333335 several times now. The way you did it seems ok. Another way is to set np.random.seed() or use Sacred for documented randomness. Using random_state is what the docs describe:

If your code relies on a random number generator, it should never use functions like numpy.random.random or numpy.random.normal. This approach can lead to repeatability issues in unit tests. Instead, a numpy.random.RandomState object should be used, which is built from a random_state argument passed to the class or function.

serv-inc
  • 35,772
  • 9
  • 166
  • 188