Right now you're starting with a list of terms that you want in your formula, then trying to paste them together into a complicated string, which patsy will parse and convert back into a list of terms. You can see the data structure that patsy generates for this kind of formula (ModelDesc.from_formula
is patsy's parser):
In [7]: from patsy import ModelDesc
In [8]: ModelDesc.from_formula("y ~ x1 + x2 + x3")
Out[8]:
ModelDesc(lhs_termlist=[Term([EvalFactor('y')])],
rhs_termlist=[Term([]),
Term([EvalFactor('x1')]),
Term([EvalFactor('x2')]),
Term([EvalFactor('x3')])])
This might look a little intimidating, but it's pretty simple really -- you have a ModelDesc
, which represents a single formula, and it has a left-hand-side list of terms and a right-hand-side list of terms. Each term is represented by a Term
object, and each Term
has a list of factors. (Here each term just has a single factor -- if you had any interactions then those terms would have multiple factors.) Also, the "empty interaction" Term([])
is how patsy represents the intercept term.
So you can avoid all this complicated quoting/parsing stuff by directly creating the terms you want and passing them to patsy, skipping the string parsing step
from patsy import ModelDesc, Term, LookupFactor
response_terms = [Term([LookupFactor(response)])]
# start with intercept...
model_terms = [Term([])]
# ...then add another term for each candidate
model_terms += [Term([LookupFactor(c)]) for c in candidates]
model_desc = ModelDesc(response_terms, model_terms)
and now you can pass that model_desc
object into any function where you'd normally pass a patsy formula:
ols(model_desc, data).fit().rsquared_adj
There's another trick here: you'll notice that the first example has EvalFactor
objects, and now we're using LookupFactor
objects instead. The difference is that EvalFactor
takes a string of arbitrary Python code, which is nice if you want to write something like np.log(x1)
, but really annoying if you have variables with name like weight.in.kg
. LookupFactor
directly takes the name of a variable to look up in your data, so no further quoting is needed.
Alternatively, you could do this with some fancier Python string processing, like:
quoted = ["Q('{}')".format(c) for c in candidates]
formula = "{} ~ {} + 1".format(response, ' + '.join(quoted))
But while this is a bit simpler to start with, it's much more fragile -- for example, think about (or try) what happens if one of your parameters contains a quote character! You should never write something like this in a processing pipeline where the candidate names come from somewhere else that you can't control (e.g. a random CSV file) -- you could get all kinds of arbitrary code execution. The solution above avoids all of these problems.
Reference: