When I have an underconstrained system with confounded columns, lm
in R ignores many second- and third-factor interactions (which to me seems correct behaviour), but statsmodels (in Python) splits the value among all confounded columns.
Imagine I have the following data:
a b c y
--------------------
-1 -1 1 4
1 -1 -1 30
-1 1 -1 6
1 1 1 4
Using lm('y ~ a * b * c')
in R gives me the following coefficients:
- a, 11
- b, 6
- c, -6
- intercept, 11
I can get that with 'y ~ a + b + c'
in statsmodels, but the product version splits the coefficients with 2+-factor interactions. (a with b:c, b with a:c, c with a:b, and intercept with a:b:c.)
Adding as opposed to multiplying does not work for more complicated ones, where R finds some significant two-factor interactions but does not give anything else.
How can I make statsmodels act like R in this case? Or how can I set it up to get a decent result?
A MWE:
import pandas as pd
import pyDOE
import statsmodels.formula.api as smf
water_frac = pd.DataFrame(pyDOE.fracfact("a b ab"), columns=["A", "B", "C"])
water_frac["y"] = [4, 30, 6, 4]
When you do:
smf.ols(formula="y ~ A+B+C", data=water_frac).fit().params
You get:
Intercept 11.0
A 6.0
B -6.0
C -7.0
dtype: float64
While this:
smf.ols(formula="y ~ A*B*C", data=water_frac).fit().params
gives:
Intercept 5.5
A 3.0
B -3.0
A:B -3.5
C -3.5
A:C -3.0
B:C 3.0
A:B:C 5.5
dtype: float64