2

I am trying to implement a logistic regression using statsmodels (I need the summary) and I get this error:

LinAlgError: Singular matrix

My df is numeric and correlated, I deleted the non-numeric and constant features. I tried to implement regular regression as well as one with l1 penalty (l2 isn't available) because of the correlated features.

I tried to check the matrix rank and got this print:

print(len(df.columns)) -> 156

print(np.linalg.matrix_rank(df.values)) -> 151

How do I know which features are a problem and why?

my code:

logit = sm.Logit(y,X)

result = logit.fit_regularized(trim_mode='auto', alpha=0,maxiter=150)

print(result.summary())

Update:

after removing highly correlated features I get:

  len(df.columns) =  np.linalg.matrix_rank(df.values)

but still the same error. (even if I set a low correlation threshold).

I tried to change the solver as well.

anna
  • 91
  • 3
  • 10
  • 1
    Try df.corr() - this returns a matrix of correlations between the numeric columns in your dataframe. From that you can check if any two of your features are exactly correlated. – Johannes Wachs Nov 05 '17 at 13:22
  • @Johannes Wachs , I deleted the correlated features and it works. tnx. – anna Nov 05 '17 at 13:42
  • see https://stackoverflow.com/a/13313828/333700 for how to use QR to find all collinear or linearly independent columns – Josef Nov 05 '17 at 14:02

1 Answers1

2

As suggested in the comments, if two features are exactly correlated the model won't run. The easiest way to check this if you have a pandas dataframe with a small number of columns is to call the .corr() method on your dataframe - in this case df.corr(), and check if any pair of features have correlation =1.

You should really think about why some features are perfectly correlated though.

Johannes Wachs
  • 1,270
  • 11
  • 15
  • is there any other reason way? I just removed all of the features with 0.4 corr and up and I got the same error... – anna Nov 16 '17 at 12:20