How do I get the columns that a statsmodels / patsy formula depends on?

Question

Suppose I have a pandas dataframe:

df = pd.DataFrame({'x1': [0, 1, 2, 3, 4], 
                   'x2': [10, 9, 8, 7, 6], 
                   'x3': [.1, .1, .2, 4, 8], 
                   'y': [17, 18, 19, 20, 21]})

Now I fit a statsmodels model using a formula (which uses patsy under the hood):

import statsmodels.formula.api as smf
fit = smf.ols(formula='y ~ x1:x2', data=df).fit()

What I want is a list of the columns of df that fit depends on, so that I can use fit.predict() on another dataset. If I try list(fit.params.index), for example, I get:

['Intercept', 'x1:x2']

I've tried recreating the patsy design matrix, and using design_info, but I still only ever get x1:x2. What I want is:

['x1', 'x2']

Or even:

['Intercept', 'x1', 'x2']

How can I get this from just the fit object?

Why not just split `'x1:x2'` on `':'`, then, if you're just interacting `x1` and `x2`? Something like `fit.model.formula.split(':')` and then filter the rest out appropriately. Hell, a regex split would be even better, handling `+`, `:`, etc. — blacksite, Apr 12 '17 at 19:39
@bwk have you made any progress with this issue? Have a look at my answer, it should fit your needs. — Jan Trienes, Apr 13 '17 at 16:15

Jan Trienes · Accepted Answer · 2017-04-12T20:38:11.350

4

Simply test if the column names appear in the string representation of the formula:

ols = smf.ols(formula='y ~ x1:x2', data=df)
fit = ols.fit()

print([c for c in df.columns if c in ols.formula])
['x1', 'x2', 'y']

There is another approach by reconstructing the patsy model (more verbose, but also more reliable) and it does not depend on the original data frame:

md = patsy.ModelDesc.from_formula(ols.formula)
termlist = md.rhs_termlist + md.lhs_termlist

factors = []
for term in termlist:
    for factor in term.factors:
        factors.append(factor.name())

print(factors)
['x1', 'x2', 'y']

edited Apr 12 '17 at 20:38

answered Apr 12 '17 at 20:10

Jan Trienes

2,501
1
16
28

Thanks, this is what I was looking for! – bwk Apr 15 '17 at 18:55
1

This doesn't work if you happen to have a column name that is a substring of the formula. For example, I have a column called `rt`, which gets captured by the formula `C(other_col, Helmert)` – sammosummo Oct 27 '17 at 20:35
@sammosummo Indeed, the first approach I mention in the answer will match any substrings. However, the second approach does not perform this string matching. – Jan Trienes Oct 28 '17 at 06:50
1

The second approach doesn't answer the OP's question, since it finds patsy terms, not the columns from the original dataframe. These aren't the same when using categorically coded factors. – sammosummo Oct 28 '17 at 11:51

score 1 · Answer 2 · answered Apr 12 '17 at 19:58

predict takes the same structure of data frame or a dictionary, and a call patsy converts it in a compatible way. To replicate this you can also check the code in statsmodels.base.model.Results.predict the core of which is

exog = dmatrix(self.model.data.design_info.builder,
                           exog, return_type="dataframe")

The formula information itself is stored in the description of the terms in design_info. The variable names itself are used in summary() and as index in the returned pandas Series for example in results.params.

Getting the formula is not a problem. The variable names in `summary()` are not the original variables in the DataFrame, but the _transformed_ variable names; i.e. there's "x1:x2" instead of "x1" and "x2" separately. — bwk, Apr 12 '17 at 20:12

score 0 · Answer 3 · answered Jun 02 '20 at 19:36

0

ols.exog_names and ols.endog_names should do it

answered Jun 02 '20 at 19:36

Ferus

1,080
3
12
17

How do I get the columns that a statsmodels / patsy formula depends on?

3 Answers3