2

The libraries statsmodels and sklearn produce different values of the log-loss function. A toy example:

import pandas as pd
import statsmodels.api as sm
from sklearn.metrics import log_loss

df = pd.DataFrame(
    columns=['y','x1','x2'],
    data=[
        [1,3,5],
        [1,-2,7],
        [0,-1,-5],
        [0,2,3],
        [0,3,5],
    ])

logit = sm.Logit(df.y,df.drop(columns=['y']))

res = logit.fit()

The result of res.llf is -1.386294361119906, while the result of -log_loss(df.y,res.fittedvalues) is -6.907755278982137. Shouldn't they be equal (up to a small difference due to different numerical implementations)? The statsmodels documentation says that .llf is the log likelihood of the model and as this question and this Kaggle post point out, log_loss is just the negative of the log likelihood.

Package versions: scikit-learn==1.0.1, statsmodels==0.13.5

dwolfeu
  • 1,103
  • 2
  • 14
  • 21

1 Answers1

2

As you can see, res.fittedvalues returns some negative values. If you want the prediction for your values, you should use res.predict() instead (values between 0 and 1).
You can calculate the log-loss in the following ways:
1. Using sklearn log_loss:

log_loss(df.y, res.predict())
--> 0.27725887222398127

2. Using statsmodels:

res.mle_retvals['fopt']
--> 0.27725887222398116
# or
res.llf / res.nobs
--> -0.27725887222398116

The very small difference is due to calculation rounding.

Note: In order to get the predicted values from res.fittedvalues you need to apply the expit function (inverse of logit):

from scipy.special import expit

expit(res.fittedvalues)

This returns the same predictions as res.predict().

Mattravel
  • 1,358
  • 1
  • 15
  • 1
    great answer. Ignore `fittedvalues` in discrete models. Those are the linear predictors and not an expected conditional mean as in all other models :( – Josef Mar 06 '23 at 04:07
  • Yep, great answer! What does `res.llf` return though? Shouldn't this also agree with `log_loss(df.y, res.predict())`? – dwolfeu Mar 06 '23 at 04:46
  • 2
    `llf` is the sum of loglike over observation, `res.mle_retvals['fopt']` is minus loglike / nobs. AFAICS, sklearn `log_loss` defaults to averaging not sum. – Josef Mar 06 '23 at 17:09
  • Thanks @Josef, I couldn't reconcile `llf`. I have updated the answer with that information. – Mattravel Mar 07 '23 at 00:00