1

When I have an underconstrained system with confounded columns, lm in R ignores many second- and third-factor interactions (which to me seems correct behaviour), but statsmodels (in Python) splits the value among all confounded columns.

Imagine I have the following data:

  a    b    c    y  
--------------------
 -1   -1    1    4
  1   -1   -1   30
 -1    1   -1    6
  1    1    1    4 

Using lm('y ~ a * b * c') in R gives me the following coefficients:

  • a, 11
  • b, 6
  • c, -6
  • intercept, 11

I can get that with 'y ~ a + b + c' in statsmodels, but the product version splits the coefficients with 2+-factor interactions. (a with b:c, b with a:c, c with a:b, and intercept with a:b:c.)

Adding as opposed to multiplying does not work for more complicated ones, where R finds some significant two-factor interactions but does not give anything else.

How can I make statsmodels act like R in this case? Or how can I set it up to get a decent result?

A MWE:

import pandas as pd
import pyDOE
import statsmodels.formula.api as smf

water_frac = pd.DataFrame(pyDOE.fracfact("a b ab"), columns=["A", "B", "C"])
water_frac["y"] = [4, 30, 6, 4]

When you do:

smf.ols(formula="y ~ A+B+C", data=water_frac).fit().params

You get:

Intercept    11.0
A             6.0
B            -6.0
C            -7.0
dtype: float64

While this:

smf.ols(formula="y ~ A*B*C", data=water_frac).fit().params

gives:

Intercept    5.5
A            3.0
B           -3.0
A:B         -3.5
C           -3.5
A:C         -3.0
B:C          3.0
A:B:C        5.5
dtype: float64
Cleb
  • 25,102
  • 20
  • 116
  • 151
Jean Nassar
  • 555
  • 1
  • 5
  • 13
  • statsmodels OLS is using the generalized inverse by default instead of automatically dropping columns. See for example http://stackoverflow.com/questions/37472963/statmodels-in-python-package-how-exactly-duplicated-features-are-handled/37499160#37499160 – Josef Jul 09 '16 at 08:13
  • Thank you. I've asked for clarification in a comment there. – Jean Nassar Jul 11 '16 at 17:46

0 Answers0