2

I'm finding this one to be a real head-scratcher. I have a python 2 notebook that I'm using to do linear regression on a laptop and a desktop. On the laptop sklearn gives the same results as statsmodels. However, on the desktop, statsmodels gives the correct result, but sklearn gives a wrong result. A number of the coefficient estimates have just blown up 8 orders of magnitude larger than they should be, e.g., 304952680 vs -0.1271. Again, I save the notebook, pull it up on my laptop, run it again and the statsmodels vs sklearn linear regression results are equal. Re-connect and re-run the notebook again from scratch on the desktop and, again, statsmodels is correct, but the sklearn LinearRegression blows up again. I am mystified. Anyone have any ideas?

Here are the two gists, linked through nbviewer. They are long, but compare, for example, cells 59 and 62, variable M12_CS_Months_Since_Last_Gift. For the notebook, statsmodels (cell 59) agrees with sklearn (cell 62). For the desktop, they disagree (see blow up for that variable in desktop cell 62). One thing that may be worth noting: the data is characterized by large segments of the predictor space corresponding to the same observed value. Maybe this suggests near collinearity as suggested? I'll check singular values. Additional suggestions or follow ups to that suggestion would be welcome. Laptop is 64 bit windows 8.1/statsmodels v.0.6.1/sklearn 0.17. Desktop is windows 10 64 bit, same statsmodels/sklearn module versions. notebook: http://nbviewer.jupyter.org/gist/andersrmr/fb7378f3659b8dd48625 desktop: http://nbviewer.jupyter.org/gist/andersrmr/76e219ad14ea9cb92d9e

  • 1
    Can you make a reproducible example and explain the differences between the two sets of hardware? – Paul H Feb 22 '16 at 23:28
  • 1
    also, how did you install python, statsmodels, and sklearn on each machine? – Paul H Feb 22 '16 at 23:29
  • How to create reproducible example of cross-hardware differences? Both machines are windows anaconda installs. laptop still on windows 8.1, desktop is windows 10. desktop anaconda install is much more recent. – Richard Anderson Feb 22 '16 at 23:38
  • 1
    Well, you're ostensibly doing the linear regression against the same dataset on both machines, so you should include that dataset in the questions so people can test it out themselves and see if they get consistent answers. You should also show the commands you use to execute the regressions. Lastly, you should edit the question to include the hardware and installation information, rather than posting it in the comments. Here's are some good explanations on how to create a reproducible examples: http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – Paul H Feb 22 '16 at 23:41
  • I agree that code and data (or at least some stats about the data) would be helpful. It would also be good to include what version of sklearn and statsmodels you're using. Also, you say the coefficients are different. Does this difference dramatically affect predicted values, or do the differences balance each other out on your data? My suspicion is that your problem is being caused by the presence of data columns that are nearly linearly dependent. You can check for linear dependence in various ways. A simple one is to compute the singular values. – jcrudy Feb 23 '16 at 07:46

1 Answers1

2

I looked at your notebooks. It looks like the performance for both your laptop and desktop models on the training set are virtually identical. That means these large coefficient values balance each other out on your training set. So, the laptop's result isn't exactly wrong, it just defies the kind of interpretation you might like to attach to it. It also has a larger risk of being over fit (I didn't see if you scored it on a testing set, but you should). Basically, if you attempt to apply this fitted model to an example that violates the colinearity observed in the training set, you'll get ridiculous predictions.

Why is this occurring on one machine and not another? Basically, the coefficients on the set of nearly colinear predictors are numerically unstable, meaning that very small perturbations can lead to large differences. Differences in underlying numerical libraries normally invisible to the user can therefore lead to significant changes in the coefficients. If you think about it in terms of linear algebra, it will make sense why this happens. If two predictors are exactly colinear, the sum of their coefficients will be fixed but either of the two coefficients can grow without bound as long as the other balances it out.

What is the solution? If there is a real, exact dependence between these variables that will always be present, you can probably ignore the issue. However, I wouldn't because you never know. Otherwise, either remove the dependent columns manually (which will not hurt prediction), pre-process with an automatic variable selection or dimension reduction technique, or use a regularized regression method (such as ridge regression).

Note: It's possible I'm wrong in my assumptions here. It would be good to validate the colinearity by singular values. If you do so, please comment.

Second Note: There are least squares solvers which will automatically zero out dependent columns. If you look at scipy.linalg.lstsq, you can pass a cutoff argument (cond) in order to zero out small singular values. Also, some solvers are more stable than others, as you've seen. You can always just use the more stable solver.

jcrudy
  • 3,921
  • 1
  • 24
  • 31
  • 1
    It's a singular design matrix. The statsmodels summary shows a condition number of 1.13e+16, which means that there are essentially singular eigenvalues. statsmodels uses by default `pinv` which uses regularization in the SVD using the numpy default which is very small less than 1e-15 IIRC. (The print version of the summary would show a warning text, but it's not included in the html version.) – Josef Feb 26 '16 at 06:22