3

When performed a logistic regression using the two API, they give different coefficients. Even with this simple example it doesn't produce the same results in terms of coefficients. And I follow advice from older advice on the same topic, like setting a large value for the parameter C in sklearn since it makes the penalization almost vanish (or setting penalty="none").

import pandas as pd
import numpy as np
import sklearn as sk
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm

n = 200

x = np.random.randint(0, 2, size=n)
y = (x > (0.5 + np.random.normal(0, 0.5, n))).astype(int)

display(pd.crosstab( y, x ))


max_iter = 100

#### Statsmodels
res_sm = sm.Logit(y, x).fit(method="ncg", maxiter=max_iter)
print(res_sm.params)

#### Scikit-Learn
res_sk = LogisticRegression( solver='newton-cg', multi_class='multinomial', max_iter=max_iter, fit_intercept=True, C=1e8 )
res_sk.fit( x.reshape(n, 1), y )
print(res_sk.coef_)

For example I just run the above code and get 1.72276655 for statsmodels and 1.86324749 for sklearn. And when run multiple times it always gives different coefficients (sometimes closer than others, but anyway).

Thus, even with that toy example the two APIs give different coefficients (so odds ratios), and with real data (not shown here), it almost get "out of control"...

Am I missing something? How can I produce similar coefficients, for example at least at one or two numbers after the comma?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
hellowolrd
  • 361
  • 1
  • 3
  • 13
  • 1
    Looks like there is a similar question: [SO-statmodel-sklearn](https://stats.stackexchange.com/questions/203740/logistic-regression-scikit-learn-vs-statsmodels) – tianlinhe May 25 '20 at 16:18

1 Answers1

10

There are some issues with your code.

To start with, the two models you show here are not equivalent: although you fit your scikit-learn LogisticRegression with fit_intercept=True (which is the default setting), you don't do so with your statsmodels one; from the statsmodels docs:

An intercept is not included by default and should be added by the user. See statsmodels.tools.add_constant.

It seems that this is a frequent point of confusion - see for example scikit-learn & statsmodels - which R-squared is correct? (and own answer there as well).

The other issue is that, although you are in a binary classification setting, you ask for multi_class='multinomial' in your LogisticRegression, which should not be the case.

The third issue is that, as explained in the relevant Cross Validated thread Logistic Regression: Scikit Learn vs Statsmodels:

There is no way to switch off regularization in scikit-learn, but you can make it ineffective by setting the tuning parameter C to a large number.

which makes the two models again non-comparable in principle, but you have successfully addressed it here by setting C=1e8. In fact, since then (2016), scikit-learn has indeed added a way to switch regularization off, by setting penalty='none' since, according to the docs:

If ‘none’ (not supported by the liblinear solver), no regularization is applied.

which should now be considered the canonical way to switch off the regularization.

So, incorporating these changes in your code, we have:

np.random.seed(42) # for reproducibility

#### Statsmodels
# first artificially add intercept to x, as advised in the docs:
x_ = sm.add_constant(x)
res_sm = sm.Logit(y, x_).fit(method="ncg", maxiter=max_iter) # x_ here
print(res_sm.params)

Which gives the result:

Optimization terminated successfully.
         Current function value: 0.403297
         Iterations: 5
         Function evaluations: 6
         Gradient evaluations: 10
         Hessian evaluations: 5
[-1.65822763  3.65065752]

with the first element of the array being the intercept and the second the coefficient of x. While for scikit learn we have:

#### Scikit-Learn

res_sk = LogisticRegression(solver='newton-cg', max_iter=max_iter, fit_intercept=True, penalty='none')
res_sk.fit( x.reshape(n, 1), y )
print(res_sk.intercept_, res_sk.coef_)

with the result being:

[-1.65822806] [[3.65065707]]

These results are practically identical, within the machine's numeric precision.

Repeating the procedure for different values of np.random.seed() does not change the essence of the results shown above.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • 1
    Thanks! I set fit_intercept to True or False but of course still get different coefficients since the "major" problem was with the multi_class parameter. I thought that setting it to "multinomial" would not change the behavior with a binary classification... Ah my fault! Thanks again! – hellowolrd May 25 '20 at 16:37