18

This is probably a simple question but I am trying to calculate the p-values for my features either using classifiers for a classification problem or regressors for regression. Could someone suggest what is the best method for each case and provide sample code? I want to just see the p-value for each feature rather than keep the k best / percentile of features etc as explained in the documentation.

Thank you

user1096808
  • 253
  • 1
  • 4
  • 11

3 Answers3

23

You can use statsmodels

import statsmodels.api as sm
logit_model=sm.Logit(y_train,X_train)
result=logit_model.fit()
print(result.summary())

The results would be something like this

                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:               406723
Model:                          Logit   Df Residuals:                   406710
Method:                           MLE   Df Model:                           12
Date:                Fri, 12 Apr 2019   Pseudo R-squ.:                0.001661
Time:                        16:48:45   Log-Likelihood:            -2.8145e+05
converged:                      False   LL-Null:                   -2.8192e+05
                                        LLR p-value:                8.758e-193
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
x1            -0.0037      0.003     -1.078      0.281      -0.010       0.003
LinNotFound
  • 531
  • 1
  • 5
  • 12
  • I agree, StatsModels is developped by statistician, you will have more infos. Sklearn is developped by developper, it will be easier to use and to integrate in a pipeline. Chose carrefully your tool against your objective – el Josso Apr 28 '21 at 14:38
  • So much simpler! Thanks! I had gone about this the long way using `sklearn.` – Alain Jan 19 '22 at 21:58
  • 1
    This is a great answer, but it is worth noting that `sm.Logit` will not automatically add an intercept term, where `sklearn.LogisticRegression` will. Therefore, I recommend changing the code to `logit_model=sm.Logit(y_train,sm.add_constant(X_train))` to manually add the intercept term. – Steve Walsh Jan 20 '23 at 16:47
11

Just run the significance test on X, y directly. Example using 20news and chi2:

>>> from sklearn.datasets import fetch_20newsgroups_vectorized
>>> from sklearn.feature_selection import chi2
>>> data = fetch_20newsgroups_vectorized()
>>> X, y = data.data, data.target
>>> scores, pvalues = chi2(X, y)
>>> pvalues
array([  4.10171798e-17,   4.34003018e-01,   9.99999996e-01, ...,
         9.99999995e-01,   9.99999869e-01,   9.99981414e-01])
Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • 1
    Looks good. And how can I bring all these numbers to a 0.0000 form? (very noob sorry) – user1096808 Mar 11 '14 at 16:48
  • I used scores, pvalues = chi2(traindata, targetdata) pvalues=["{0:.7f}".format(x)for x in pvalues] print pvalues is this the right way> thx – user1096808 Mar 11 '14 at 17:16
  • @user1096808 Number formatting is covered by the Python tutorial, please read that. – Fred Foo Mar 12 '14 at 11:45
  • 4
    I'm getting "Input X must be non-negative." specifically for the chi2 test. Does this only work with variables that have no negative values? How do you get a p-value for features which aren't necessarily always positive? – Alexis Eggermont Aug 25 '15 at 08:53
  • 35
    The OP seems to want the p-values for each feature in a regression as returned by `statsmodels`. The p-values in this answer are NOT those p-values. These are univariate chi-squared tests, meaning that each feature is tested independently, not in a common model. – Adam Nov 25 '18 at 12:53
  • scipy.stats.linregression, while not specific to how to calculate them, do provide the 'correct' p values: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.linregress.html – DaveRGP Apr 09 '19 at 14:04
  • @AlexisEggermont Chi2 only works for non negative number. use f_regression instead if you have negative value However, both chi2 and f_regression are ANOVA. They only consider one variable against the response variable one at a time. If you want the p value for each coefficient in the true model. Use the statsmodels model. see this post: https://stackoverflow.com/questions/27928275/find-p-value-significance-in-scikit-learn-linearregression/34983005#34983005 – music_piano Jan 04 '20 at 16:42
1

Your question is how to calculate p values using "sklearn", without doing an extra pip install of statsmodel

from sklearn.feature_selection import f_regression

freg=f_regression(x,y)

p=freg[1]

print(p.round(3))
Karan Bhandari
  • 370
  • 3
  • 12
  • These seems to be a good answer too, but explain a little bit what is happening here with some documentation, and it will be easier for anyone to understand. – Javier Huerta Feb 01 '22 at 18:20
  • These seems to be a good answer too, but explain a little bit what is happening here with some documentation, and it will be easier for anyone to understand. – Javier Huerta Feb 01 '22 at 18:21