7

Disclaimer: I'm not a regex expert.

I'm using Python re module to perform regex matching on many htm files. One of the patterns is something like this:

<bla><blabla>87765.*</blabla><bla>

The problem I've encountered is that instead of finding all (say) five occurrences of the pattern, it will find only one. Because it welds all the occurrences into one, using the <bla><blabla>87765 part of the first occurrence and the </blabla><bla> part of the last occurrence in the page.

Is there any way to tell re to find the smallest match?

Dave Berk
  • 1,591
  • 2
  • 15
  • 17

4 Answers4

19

You can use a reluctant qualifier in your pattern (for more details, reference the python documentation on the *?, +?, and ?? operators):

<bla><blabla>87765.*?</blabla><bla>

Or, exclude < from the possible matched characters:

<bla><blabla>87765[^<]*</blabla><bla>

only if there are no children tags between <blabla> and </blabla>.

tzot
  • 92,761
  • 29
  • 141
  • 204
iammichael
  • 9,477
  • 3
  • 32
  • 42
5

The Python re module supports nongreedy matching. You just add a ? to the end of the wildcard pattern, such as .*?. You can learn more at this HOWTO.

Gordon Seidoh Worley
  • 7,839
  • 6
  • 45
  • 82
  • 5
    Non-greedy matching is not a solution to this question. Try the pattern 'first something first second' and the pattern 'first(.*?)second'. Even if it's non-greedy it will still take the first match it finds, which is the bigger, left-most one. The greediness just affects how it treats the pattern at each character. – Bill Prin Nov 22 '12 at 23:31
1
I believe the regex
<bla><blabla>87765.*?</blabla><bla>
can produce catastrophic backtracking.

Instead, use:
<bla><blabla>87765[^<]*</blabla><bla>

Using atomic grouping (I'm not sure Python supports this), 
the above regex becomes 
<bla><blabla>(?>(.*?<))/blabla><bla>

Everything between (?> ... ) is treated as one single token by the regex engine, once the regex engine leaves the group. Because the entire group is one token, no backtracking can take place once the regex engine has found a match for the group. If backtracking is required, the engine has to backtrack to the regex token before the group (the caret in our example). If there is no token before the group, the regex must retry the entire regex at the next position in the string. Note that I needed to include the "<" in the group to ensure atomicity. Close enough.

  • With only the one quantifier and no alternation, I don't see any potential for catastrophic backtracking. The `.*?` version might be less efficient than `[^<]*` but you'd probably never notice the difference. And, no, Python does not support atomic groups. – Alan Moore Sep 15 '09 at 21:08
0

Um... there is a way to tell re to find the smallest match, and it's precisely by using non-greedy quantifiers.

<bla><blabla>87765.*?</blabla><bla>

I can't imagine why you would want to do it while using greedy quantifiers.

David Z
  • 128,184
  • 27
  • 255
  • 279