2

I want to match all combinations of <>=/*+- except for = and =>. How can I do this?

 r = re.compile(r'[<>=/*+-]+')

This matches one or more characters in the set but I don't know how to prevent it from matching the = or => patterns. I'd guess it has something to do with negative lookahead or lookbehind but it's hard for me to wrap my head around that.


clarification: I literally want to match all combinations of the characters in <>=/*+- except for = and =>. In other words, I want to find maximal-length consecutive substrings consisting only of these characters -- and if the substring equals = or =>, it should not be considered a match.

I apologize for not clarifying earlier, but it seemed like a simple enough problem statement not to need the extra clarification.

Example cases:

  • pow pow -> bah bah contains the match ->
  • a +++->* b // c contains the matches +++->* and //
  • => 3 <= 4 = 5 == 6 contains the matches <= and == (remember, = and => are not matches)
  • a <=> b <@> c contains the matches <=> and < and >
  • ---= =--- contains the matches ---= and =---
Jason S
  • 184,598
  • 164
  • 608
  • 970
  • 3
    To clarify, `=<` should match but `=` and `=>` should not? Also, this looks as if you're matching operators like `+` and `+=`, but as written strings like `/*>=` will match. Is that what you want? – John Kugelman Dec 23 '13 at 19:17
  • So if there's a `=>`, do you want to find the `>`, or nothing? – abarnert Dec 23 '13 at 19:21
  • Also, following up on what @JohnKugelman says: regexps are a bad way to parse expression syntax, but there are very good ways to do it that are simple and efficient; you can build an [OP parser](http://en.wikipedia.org/wiki/Operator-precedence_parser) or other simple shift-reduce parser from scratch pretty easily, or use something like PyParsing. – abarnert Dec 23 '13 at 19:24
  • I want arbitrary strings of those characters except for `=` or `=>`. So `==` or `/**+` or `-=>` should match, but not `=` or `=>`. – Jason S Dec 23 '13 at 19:24
  • FWIW I'm working on a lexer using ply (which uses regexps). There's a higher-level question here but I want the answer to this one first so I can form my other question more intelligently. – Jason S Dec 23 '13 at 19:25
  • @JasonS: So `=>` should not match as `>`, but `==>` should match as a single group? – abarnert Dec 23 '13 at 19:26
  • abarnert: "So if there's a =>, do you want to find the >, or nothing?" I want to find nothing. – Jason S Dec 23 '13 at 19:26
  • @JasonS: And the other question? In `a==>b`, you want the `==>`? – abarnert Dec 23 '13 at 19:28
  • 2
    If you insist on going the re route, I think that it might be best to split this into two expressions - one matches the characters you do want, and then one filtering out the characters you don't want. There probably is a single expression that will do it all, but it will probably be quite unreadable. – nfazzio Dec 23 '13 at 19:28
  • 1
    @nfazzio: And it may also be horribly inefficient (possibly even requiring exponential backtracking, while two linear regexps in series is obviously still linear—and they're dead-simple and fast ones, too). – abarnert Dec 23 '13 at 19:30
  • @abarnert: You seem to know way more about regex than me - do you know of any tools that compute the time complexity of pattern matching? – nfazzio Dec 23 '13 at 19:34
  • 1
    @nfazzio: IIRC, someone proved (with Perl regexps, which are not quite the same as Python's, but pretty close) that it is impossible to determine which regexps can be exponential in sub-exponential time. You could easily write a tool that flagged any regexp that _might_ be exponential, but that would give you a ton of false positives. See [Catastrophic Backtracking](http://www.regular-expressions.info/catastrophic.html) for a simple discussion, and [this question](http://stackoverflow.com/questions/8887724/why-can-regular-expressions-have-an-exponential-running-time) for lots of good links. – abarnert Dec 23 '13 at 19:56
  • 2
    @JasonS: Since there are a lot of non-obvious edge cases here, instead of answering them one by one in comments and leaving us to try to guess the consequences, it would really help to provide a list of test cases and expected output to make this unambiguous. The question really should be meaningful to someone who hasn't read dozens of scattered comments, and this one is not. – abarnert Dec 23 '13 at 20:01

2 Answers2

2

edited: Implemented abarnert's suggestions below:

I would split this into two parts:

The first part will return a list of all matches - including the '=>' and '=' that you don't wish to match.

p1 = re.compile(r'[<>=/*+-]+')

The second part will filter these matches out.

all_matches = p1.finditer(your_string)
matches = [match.group() for match in all_matches if match.group() not in ('=', '=>')]
nfazzio
  • 498
  • 1
  • 3
  • 12
  • 1
    Great approach, and nice simple implementation. But two minor quibbles. First, if you've got a compiled regexp, use `p1.findall(s)`, not `re.findall(p1, s)`. Second, use `finditer` instead of `findall`; no reason to build a list just to iterate over it. – abarnert Dec 23 '13 at 20:57
  • @abarnert: thanks for the suggestions - I've implemented them. A question about the `callable-iterator` type: If I iterate through it once in a list comprehension, how can I iterate through it again? If I do the list comprehension twice, the first time it results in what I want, but the second time it results in an empty list. – nfazzio Dec 23 '13 at 21:17
  • You can't iterate through an iterator twice. After you create `matches`, `all_matches` is empty. If you really _need_ to iterate through something twice, then you probably want a list. But in most cases, you don't have any need for that, so don't pay the cost. See [Iterators](http://docs.python.org/2/tutorial/classes.html#iterators) and the following sections in the tutorial. – abarnert Dec 23 '13 at 21:32
0

This might work:

pat = re.compile(r'((?!=|=>)[<>=/*+-]+)')

It uses negative look-around syntax, described in detail here: Regular expression to match a line that doesn't contain a word?

EDIT: The simple look-around above will unfortunately match ">" when fed "=>" so to work around that it can get a little hairy:

pat = re.compile(r'((?!=>|(?!=)>)([<>/*+-]|[<>=/*+-]{2,10}))')

I'm assuming you don't want to match strings longer than 10. This separates the matches into single-character operators (from which we exclude "=") and multi-character operators (where "=" is ok) except for "=>" -- It also excludes an edge case we're not interested in, just the ">" of the rejected "=>"

This is completely unreadable, however, and if it makes it into your code there should be copious comments. Agree with other commenters that a single regex is not suited for this problem.

Community
  • 1
  • 1
Tom McClure
  • 6,699
  • 1
  • 21
  • 21