3

Suppose I have a pandas dataframe:

df = pd.DataFrame({'x1': [0, 1, 2, 3, 4], 
                   'x2': [10, 9, 8, 7, 6], 
                   'x3': [.1, .1, .2, 4, 8], 
                   'y': [17, 18, 19, 20, 21]})

Now I fit a statsmodels model using a formula (which uses patsy under the hood):

import statsmodels.formula.api as smf
fit = smf.ols(formula='y ~ x1:x2', data=df).fit()

What I want is a list of the columns of df that fit depends on, so that I can use fit.predict() on another dataset. If I try list(fit.params.index), for example, I get:

['Intercept', 'x1:x2']

I've tried recreating the patsy design matrix, and using design_info, but I still only ever get x1:x2. What I want is:

['x1', 'x2']

Or even:

['Intercept', 'x1', 'x2']

How can I get this from just the fit object?

bwk
  • 622
  • 6
  • 18
  • Why not just split `'x1:x2'` on `':'`, then, if you're just interacting `x1` and `x2`? Something like `fit.model.formula.split(':')` and then filter the rest out appropriately. Hell, a regex split would be even better, handling `+`, `:`, etc. – blacksite Apr 12 '17 at 19:39
  • @bwk have you made any progress with this issue? Have a look at my answer, it should fit your needs. – Jan Trienes Apr 13 '17 at 16:15

3 Answers3

4

Simply test if the column names appear in the string representation of the formula:

ols = smf.ols(formula='y ~ x1:x2', data=df)
fit = ols.fit()

print([c for c in df.columns if c in ols.formula])
['x1', 'x2', 'y']

There is another approach by reconstructing the patsy model (more verbose, but also more reliable) and it does not depend on the original data frame:

md = patsy.ModelDesc.from_formula(ols.formula)
termlist = md.rhs_termlist + md.lhs_termlist

factors = []
for term in termlist:
    for factor in term.factors:
        factors.append(factor.name())

print(factors)
['x1', 'x2', 'y']
Jan Trienes
  • 2,501
  • 1
  • 16
  • 28
  • Thanks, this is what I was looking for! – bwk Apr 15 '17 at 18:55
  • 1
    This doesn't work if you happen to have a column name that is a substring of the formula. For example, I have a column called `rt`, which gets captured by the formula `C(other_col, Helmert)` – sammosummo Oct 27 '17 at 20:35
  • @sammosummo Indeed, the first approach I mention in the answer will match any substrings. However, the second approach does not perform this string matching. – Jan Trienes Oct 28 '17 at 06:50
  • 1
    The second approach doesn't answer the OP's question, since it finds patsy terms, not the columns from the original dataframe. These aren't the same when using categorically coded factors. – sammosummo Oct 28 '17 at 11:51
1

predict takes the same structure of data frame or a dictionary, and a call patsy converts it in a compatible way. To replicate this you can also check the code in statsmodels.base.model.Results.predict the core of which is

exog = dmatrix(self.model.data.design_info.builder,
                           exog, return_type="dataframe")

The formula information itself is stored in the description of the terms in design_info. The variable names itself are used in summary() and as index in the returned pandas Series for example in results.params.

Josef
  • 21,998
  • 3
  • 54
  • 67
  • 1
    Getting the formula is not a problem. The variable names in `summary()` are not the original variables in the DataFrame, but the _transformed_ variable names; i.e. there's "x1:x2" instead of "x1" and "x2" separately. – bwk Apr 12 '17 at 20:12
0

ols.exog_names and ols.endog_names should do it

Ferus
  • 1,080
  • 3
  • 12
  • 17