1

I'm trying to write an integration test that uses the descriptive statistics (.describe().to_list()) of the results of a model prediction (model.predict(X)). However, even though I've set np.random.seed(###) the descriptive statistics are different after running the tests in the console vs. in the environment created by Pycharm:

Here's a MRE for local:

from sklearn.linear_model import ElasticNet
from sklearn.datasets import make_regression
import numpy as np
import pandas as pd

np.random.seed(42)

X, y = make_regression(n_features=2, random_state=42)
regr = ElasticNet(random_state=42)
regr.fit(X, y)

pred = regr.predict(X)

# Theory: This result should be the same from the result in a class
pd.Series(pred).describe().to_list()

And an example test-file:

from unittest import TestCase
from sklearn.linear_model import ElasticNet
from sklearn.datasets import make_regression
import numpy as np
import pandas as pd

np.random.seed(42)

class TestPD(TestCase):
    def testExpectedPrediction(self):
        np.random.seed(42)
        X, y = make_regression(n_features=2, random_state=42)
        regr = ElasticNet(random_state=42)
        regr.fit(X, y)

        pred = pd.Series(regr.predict(X))

        for i in pred.describe().to_list():
            print(i)

        # here we would have a self.assertTrue/Equals f.e. element

What appears to happen is that when I run this test in the Python Console, I get one result. But then when I run it using PyCharm's unittests for the folder, I get another result. Now, importantly, in PyCharm, the project interpreter is used to create an environment for the console that ought to be the same as the test environment. This leaves me to believe that I'm missing something about the way random_state is passed along. My expectation is, given that I have set a seed, that the results would be reproducible. But that doesn't appear to be the case and I would like to understand:

  1. Why they aren't equal?
  2. What I can do to make them equal?

I haven't been able to find a lot of best practices with respect to testing against expected model results. So commentary in that regard would also be helpful.

Brandon Bertelsen
  • 43,807
  • 34
  • 160
  • 255
  • Well, you can check if the dependencies / libraries are exactly the same in both the console and the pycharm environment – Vivek Kumar Feb 25 '19 at 07:51
  • 1
    Thank you Vivek. In Pycharm you can set a Python interpreter for the project. In this case I'm using docker-compose without rebuilds. So it's guaranteed that the test environments dependencies are the same as the console given that the same interpreter is being used and the docker containers are not being rebuilt. – Brandon Bertelsen Feb 25 '19 at 13:54
  • scikit-learn algos seem to be based of numpy random generator https://stackoverflow.com/a/31058798/4762738. Does also setting the Python built-in random seed change something (`import random; random.seed(42)` ? – Eskapp Feb 25 '19 at 14:43
  • Thank you Eskapp, I've tried this variation as well and it seems to have no effect. – Brandon Bertelsen Feb 25 '19 at 15:15

0 Answers0