Using ols function with parameters that contain numbers/spaces

Question

I am having a lot of difficulty using the statsmodels.formula.api function

       ols(formula,data).fit().rsquared_adj

due to the nature of the names of my predictors. The predictors have numbers and spaces etc in them which it clearly doesn't like. I understand that I need to use something like patsy.builtins.Q So lets say my predictor would be weight.in.kg , it should be entered as follows:

Q("weight.in.kg")

so I need to take my formula from a list, and the difficulty arises in modifying every item in the list with this patsy.builtin.Q

formula = "{} ~ {} + 1".format(response, ' + '.join([candidate])

with [candidate] being my list of predictors.

My question to you, dearest python experts, is how on earth do I put every individual item in the list [candidate] within the quotes in the following expression:

Q('')

so that the ols function can actually read it? Apologies if this is super obvious, me no good at python.

score 4 · Answer 1 · edited Jan 09 '19 at 22:14

Right now you're starting with a list of terms that you want in your formula, then trying to paste them together into a complicated string, which patsy will parse and convert back into a list of terms. You can see the data structure that patsy generates for this kind of formula (ModelDesc.from_formula is patsy's parser):

In [7]: from patsy import ModelDesc

In [8]: ModelDesc.from_formula("y ~ x1 + x2 + x3")
Out[8]: 
ModelDesc(lhs_termlist=[Term([EvalFactor('y')])],
          rhs_termlist=[Term([]),
                        Term([EvalFactor('x1')]),
                        Term([EvalFactor('x2')]),
                        Term([EvalFactor('x3')])])

This might look a little intimidating, but it's pretty simple really -- you have a ModelDesc, which represents a single formula, and it has a left-hand-side list of terms and a right-hand-side list of terms. Each term is represented by a Term object, and each Term has a list of factors. (Here each term just has a single factor -- if you had any interactions then those terms would have multiple factors.) Also, the "empty interaction" Term([]) is how patsy represents the intercept term.

So you can avoid all this complicated quoting/parsing stuff by directly creating the terms you want and passing them to patsy, skipping the string parsing step

from patsy import ModelDesc, Term, LookupFactor

response_terms = [Term([LookupFactor(response)])]
# start with intercept...
model_terms = [Term([])]
# ...then add another term for each candidate
model_terms += [Term([LookupFactor(c)]) for c in candidates]
model_desc = ModelDesc(response_terms, model_terms)

and now you can pass that model_desc object into any function where you'd normally pass a patsy formula:

ols(model_desc, data).fit().rsquared_adj

There's another trick here: you'll notice that the first example has EvalFactor objects, and now we're using LookupFactor objects instead. The difference is that EvalFactor takes a string of arbitrary Python code, which is nice if you want to write something like np.log(x1), but really annoying if you have variables with name like weight.in.kg. LookupFactor directly takes the name of a variable to look up in your data, so no further quoting is needed.

Alternatively, you could do this with some fancier Python string processing, like:

quoted = ["Q('{}')".format(c) for c in candidates]
formula = "{} ~ {} + 1".format(response, ' + '.join(quoted))

But while this is a bit simpler to start with, it's much more fragile -- for example, think about (or try) what happens if one of your parameters contains a quote character! You should never write something like this in a processing pipeline where the candidate names come from somewhere else that you can't control (e.g. a random CSV file) -- you could get all kinds of arbitrary code execution. The solution above avoids all of these problems.

Reference:

Your fancier Python string processing is not quite right: it has an extra pair of brackets. You want join(quoted) instead. — David Bridgeland, Jan 17 '18 at 21:29
You need a second close-paren in your fancier Python string processing. — Keller Scholl, Jan 09 '19 at 19:55

Using ols function with parameters that contain numbers/spaces

1 Answers1

Linked