I'm finding this one to be a real head-scratcher. I have a python 2 notebook that I'm using to do linear regression on a laptop and a desktop. On the laptop sklearn
gives the same results as statsmodels. However, on the desktop, statsmodels
gives the correct result, but sklearn
gives a wrong result. A number of the coefficient estimates have just blown up 8 orders of magnitude larger than they should be, e.g., 304952680
vs -0.1271
. Again, I save the notebook, pull it up on my laptop, run it again and the statsmodels
vs sklearn
linear regression results are equal. Re-connect and re-run the notebook again from scratch on the desktop and, again, statsmodels
is correct, but the sklearn
LinearRegression
blows up again. I am mystified. Anyone have any ideas?
Here are the two gists, linked through nbviewer. They are long, but compare, for example, cells 59 and 62, variable M12_CS_Months_Since_Last_Gift
. For the notebook, statsmodels (cell 59) agrees with sklearn (cell 62). For the desktop, they disagree (see blow up for that variable in desktop cell 62). One thing that may be worth noting: the data is characterized by large segments of the predictor space corresponding to the same observed value. Maybe this suggests near collinearity as suggested? I'll check singular values. Additional suggestions or follow ups to that suggestion would be welcome. Laptop is 64 bit windows 8.1/statsmodels v.0.6.1/sklearn 0.17. Desktop is windows 10 64 bit, same statsmodels/sklearn module versions.
notebook: http://nbviewer.jupyter.org/gist/andersrmr/fb7378f3659b8dd48625
desktop: http://nbviewer.jupyter.org/gist/andersrmr/76e219ad14ea9cb92d9e