Logit estimator in `statsmodels` and `sklearn`

Question

I'm pretty sure it's a feature, not a bug, but I would like to know if there is a way to make sklearn and statsmodels match in their logit estimates. A very simple example:

import numpy as np
import statsmodels.formula.api as sm
from sklearn.linear_model import LogisticRegression

np.random.seed(123)

n = 100
y = np.random.random_integers(0, 1, n)
x = np.random.random((n, 2))
# Constant term
x[:, 0] = 1.

The estimates with statsmodels:

sm_lgt = sm.Logit(y, x).fit()
    Optimization terminated successfully.
             Current function value: 0.675320
             Iterations 4
print sm_lgt.params
    [ 0.38442   -1.1429183]

And the estimates with sklearn:

sk_lgt = LogisticRegression(fit_intercept=False).fit(x, y)
print sk_lgt.coef_
    [[ 0.16546794 -0.72637982]]

I think it's got to do with the implementation in sklearn, which uses some sort of regularization. Is there an option to estimate a barebones logit as in statsmodels (it's substantially faster and scales much more nicely). Also, does sklearn provide inference (standard errors) or marginal effects?

score 9 · Accepted Answer · answered Jul 24 '14 at 13:24

Is there an option to estimate a barebones logit as in statsmodels

You can set the C (inverse regularization strength) parameter to an arbitrarily high constant, as long as it's finite:

>>> sk_lgt = LogisticRegression(fit_intercept=False, C=1e9).fit(x, y)
>>> print(sk_lgt.coef_)
[[ 0.38440594 -1.14287175]]

Turning the regularization off is impossible because this is not supported by the underlying solver, Liblinear.

Also, does sklearn provide inference (standard errors) or marginal effects?

No. There's a proposal to add this, but it's not in the master codebase yet.

score 0 · Answer 2 · edited May 23 '17 at 12:03

As an additional note, I was struggling with difference in results when my matrix was collinear. Obviously this means there should be some additional preprocessing to get reliable result, but I was still hoping to find out why I got a result with sklearn but statsmodels errored out.

Short answer: setting solver='bfgs' when calling fit in statsmodels gives almost identical results to the sklearn model, even in the case of collinear variables (once one has taken care of the fact that default for sm is no intercept, and default for sklearn fits the intercept)

Example (adapted from a similar question on OLS):

import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression

np.random.seed = 237
num_samples=1000
X=np.random.random((num_samples, 2))
X[:, 1] = 2*X[:, 0]
X_sm = sm.add_constant(X)

beta = [1, -2, .5]
error = np.random.random(num_samples)
y = np.round(1/(1+np.exp( -(np.dot(X_sm, beta)) + error   )))  # y = 1/(1+exp(-beta*x))

lr = LogisticRegression(C=1e9).fit(X, y)

print "sklearn:"
print lr.intercept_
print lr.coef_

print "statsmodels:"
print sm.Logit(y, X_sm).fit(method='bfgs').params  # method='nm' or default method errors out

(PS If anyone has comments on the math behind these two solvers and the reliability of results I would love to hear it! I find it interesting that sklearn doesn't even throw a warning for this...)

Based on the accepted answer scikit-learn uses regularization that also works around singular or underdetermined problems. Logit and similar in statsmodels use unpenalized estimation which does not work with singular design matrices. If the optimization converges, then it is just one (arbitrary) solution, but the Hessian will not be invertible. The default Newton method uses the Hessian during optimization and will break. bfgs might work because it uses an approximation to the Hessian that might be positive definite. — Josef, May 19 '17 at 02:11

Logit estimator in `statsmodels` and `sklearn`

2 Answers2