1

I'm trying to craft a regex able to match anything up to a specific pattern. The regex then will continue looking for other patterns until the end of the string, but in some cases the pattern will not be present and the match will fail. Right now I'm stuck at:

.*?PATTERN

The problem is that, in cases where the string is not present, this takes too much time due to backtraking. In order to shorten this, I tried mimicking atomic grouping using positive lookahead as explained in this thread (btw, I'm using re module in python-2.7): Do Python regular expressions have an equivalent to Ruby's atomic grouping?

So I wrote:

(?=(?P<aux1>.*?))(?P=aux1)PATTERN

Of course, this is faster than the previous version when STRING is not present but trouble is, it doesn't match STRING anymore as the . matches everyhing to the end of the string and the previous states are discarded after the lookahead.

So the question is, is there a way to do a match like .*?STRING and alse be able to fail faster when the match is not present?

Community
  • 1
  • 1
Ilopez
  • 13
  • 3
  • Where is `STRING` in your regex? I can't quite follow your example. – Tim Pietzcker Mar 28 '14 at 14:12
  • Hi Tim. The pattern I'm looking for is "src=", and I need the .*? before because there might be several variable fields before that I want to ignore: ^(?:(?:\d{4}-\d{2}-\d{2}\s+\d\d:\d\d:\d\d)|(?:\w{3}\s+\d{1,2}\s\d\d:\d\d:\d\d))\s+id=(?P\S+)\s+sn=(?P\S+)\s+time=\"(?P\d{4}-\d{2}-\d{2}\s+\d{2}(?::|\s+)\d{2}(?::|\s+)\d{2}[^"]*)\"\s+fw=(?P\S+)\s+pri=(?P\d)\s+(?:\S+\s+)m=(?P\d+)\s+.*?src=(?P[^:\s]+)(?::(?P[^:\s]*)\S*)?\s+dst=(?P[^:\s]+)(?::(?P[^:\s]*)\S*)?\s+proto=(?P[^/]+).*?Category=\"(?P[^\"]+).*$ – Ilopez Mar 31 '14 at 06:32
  • And this is the kind of line I'm parsing: May 18 12:47:21 id=firewall sn=XXXXXXX time="2012-05-18 19:47:42 UTC" fw=xxx.xxx.xxx.xxx pri=6 c=1024 m=97 n=696201 src=xxx.xxx.xxx.xxx:xxxx:X0:xxxxxxxx dst=xxx.xxx.xxx.xxx:80:X2:xxxxxx.com proto=tcp/http op=GET sent=1274 rcvd=8355 result=0 dstname=www.xxxx.com arg=xxxxxxxxxxxxx appcat="xxxxx" appid=xxx code=31 Category="Web Communications" – Ilopez Mar 31 '14 at 06:33

3 Answers3

1

You could try using split

If the results are of length 1 you got no match. If you get two or more you know that the first one is the first match. If you limit the split to size one you'll short-circuit the later matching:

"HI THERE THEO".split("TH", 1) # ['HI ', 'ERE THEO']

The first element of the results is up to the match.

Paul Rubel
  • 26,632
  • 7
  • 60
  • 80
  • Hi Paul. Unfortunately I'm not able to change the code, which is already using re.search, but this is a nice trick to know. Thanks! – Ilopez Mar 31 '14 at 06:26
0

The Python documentation includes a brief outline of the differences between the re.search() and re.match() functions http://docs.python.org/2/library/re.html#search-vs-match. In particular, the following quote is relevant:

Sometimes you’ll be tempted to keep using re.match(), and just add .* to the front of your RE. Resist this temptation and use re.search() instead. The regular expression compiler does some analysis of REs in order to speed up the process of looking for a match. One such analysis figures out what the first character of a match must be; for example, a pattern starting with Crow must match starting with a 'C'. The analysis lets the engine quickly scan through the string looking for the starting character, only trying the full match if a 'C' is found.

Adding .* defeats this optimization, requiring scanning to the end of the string and then backtracking to find a match for the rest of the RE. Use re.search() instead.

In your case, it would be preferable to define your pattern simply as:

pattern = re.compile("PATTERN")

And then call pattern.search(...), which will not backtrack when the pattern is not found.

Community
  • 1
  • 1
Brett Lempereur
  • 815
  • 5
  • 11
  • Hi Brett. I'm already using re.search, it's just that the .*?PATTERN is part of a bigger regexp. I think I should add a comment to the question to clarify this. Thanks! – Ilopez Mar 31 '14 at 06:27
0

One-Regex Solution

^(?=(?P<aux1>(?:[^P]|P(?!ATTERN))*))(?P=aux1)PATTERN

Explanation

You wanted to use the atomic grouping like this: (?>.*?)PATTERN, right? This won't work. Problem is, you can't use lazy quantifiers at the end of an atomic grouping: the definition of the AG is that once you're outside of it, the regex won't backtrack inside.

So the regex engine will match the .*?, because of the laziness it will step outside of the group to check if the next character is a P, and if it's not it won't be able to backtrack inside the group to match that next character inside the .*.

What's usually used in Perl are structures like this: (?>(?:[^P]|P(?!ATTERN))*)PATTERN. That way, the equivalent of .* (here (?:[^P]|P(?!ATTERN))) won't "eat up" the wanted pattern.

This pattern is easier to read in my opinion with possessive quantifiers, which are made just for these occasions: (?:[^P]|P(?!ATTERN))*+PATTERN.

Translated with your workaround, this would lead to the above regex (added ^ since you should anchor the regex, either to the start of the string or to another regex).

Robin
  • 9,415
  • 3
  • 34
  • 45
  • Thanks a lot! This did the job perfectly. Unfortunately, python re module doesn't support AG nor possesive quantifiers at this moment and I'm not able to change it for the regex module that does. – Ilopez Mar 31 '14 at 06:48
  • You're welcome, glad I could help. Don't forget to mark the solution that solved your issue as "accepted" to close the question and you're good to go! – Robin Mar 31 '14 at 07:43