0

Context: Python 3.4.3

I'm not very good with regular expressions, and I can't seem to figure out a robust solution to this using re.

Suppose we have a long patsy formula and somewhere in the middle is an expression like:

... + xvar + np.log(xvar)+xvar**2 + xvar2+ z...

Patsy formulas are just strings that follow well-behaved rules, so I'm wondering if anyone has written / can easily write a robust method for dropping specific terms from a given formula? So, for example:

>>> remove_term(long_formula, 'xvar')
... + np.log(xvar)+xvar**2 + xvar2+ z...

and

>>> remove_term(long_formula, 'xvar2')
... + xvar + np.log(xvar)+xvar**2 + z...

etc. This would need to also be robust to having a variable at the beginning / end of the right-hand side formula spec.

My limited regex-foo only produces things like:

re.sub('[^(]\s*xvar\s*',' FOUND IT ', 'y ~ xvar + np.log(xvar)')

Maybe a semi-complicated if/else re.sub situation?

chriswhite
  • 1,370
  • 10
  • 21
  • Could you provide more details? Are you only looking to remove `xvar` when it's part of a series of additions/subtractions? It seems like you don't want to remove things inside parenthesis... – Laurel May 04 '16 at 17:42
  • Try [`(?m)(?:\s*(?:^|[-+/:]|\*\*?)\s*x\s*|\s*x\s*($|[-+/:]|\*\*?)\s*)+`](https://regex101.com/r/rL6dN1/3) and replace with a space. See this [Python demo](http://ideone.com/ZEIwjy) – Wiktor Stribiżew May 04 '16 at 21:14

1 Answers1

2

There is no general way to do what you want to do with a regular expressions, because Patsy's formula language is not a regular language. (Just like how HTML is not a regular language.)

But there's no need to go messing about with strings anyway -- as documented here, patsy provides a nice object-oriented representation for formulas as part of its public API. Internally, you're using this every time you call dmatrix: formula strings get parsed into this representation, and then this representation is what's used for everything downstream. But you can also work with it directly, like:

In [3]: m = patsy.ModelDesc.from_formula("xvar + np.log(xvar)+xvar**2 + xvar2")

In [4]: m
Out[4]: 
ModelDesc(lhs_termlist=[],
          rhs_termlist=[Term([]),
                        Term([EvalFactor('xvar')]),
                        Term([EvalFactor('np.log(xvar)')]),
                        Term([EvalFactor('xvar2')])])

In [5]: m.rhs_termlist.remove(patsy.Term([patsy.EvalFactor('xvar')]))

In [6]: m
Out[6]: 
ModelDesc(lhs_termlist=[],
          rhs_termlist=[Term([]),
                        Term([EvalFactor('np.log(xvar)')]),
                        Term([EvalFactor('xvar2')])])

and then pass m to patsy functions that expect a formula, like patsy.dmatrix(m, dataframe).

Community
  • 1
  • 1
Nathaniel J. Smith
  • 11,613
  • 4
  • 41
  • 49