Python regex negative lookbehind

Question

We parse logs created by automated scripts. A typical thing that we'd care about is the string: '1.10.07-SNAPSHOT (1.10.07-20110303.024749-7)' from the following line:

15:28:02.115 - INFO   - TestLib: Successfully retrieved build version: '1.11.11-SNAPSHOT (1.11.11-20110303.024749-7)'

The trouble is, some logs are manually created, with users entering this information themselves. To remind themselves of the format they have added a dialog with the template:

02:24:50.655 - INFO   - gui: Step Dialog: For test results management purposes, specify the build in which the test is executed in the following format, build version: 'specify version here'
02:25:04.905 - INFO   - gui:     Response: OK
02:25:04.905 - INFO   - gui:     Comments: 'build version: '1.11.11''

My regex for this currently is .*[Bb]uild [Vv]ersion:*\s*(?!.*<)'?([^']*)'. The '(?!.*<)' was my first attempt to avoid this problem, because some users would write ''. That doesn't catch the above case though. I think the correct thing to do is going to be a negative lookbehind that does not match if 'Step Dialog' is present on the line, but my attempts to write that seem to be failing me, according to regexr (for some reason it's not letting me share the link to my saved form). I thought negative lookbehind would look like this: (?<!Step Dialog) and result in this:

`(?<!Step Dialog).*[Bb]uild [Vv]ersion:*\s*(?!.*<)'?([^']*)'`

but that's matching both the first and third line of the above for some reason.

Edit:
'[Bb]', and ':\s' are for users who entered information in not precisely the right format by using multiple colons and spaces, capitalized 'Build'. Suggestions for cleaning this up in general are appreciated, I'm relatively new to regexs.

Chriszuma · Accepted Answer · 2011-10-14T15:01:10.790

2

You are close, but it is still matching because it can find a string that satisfies .* without being preceded by Step Dialog. Positive and negative assertions only affect the pattern immediately surrounding them. Thus, you have to force it to check every character you don't want matching Step Dialog.

Try this:

`^(?:(?!Step Dialog).)*[Bb]uild [Vv]ersion:*\s*(?!.*<)'?([^']*)'`

Now, it ensures that every character between ^ (the beginning of the line) and [Bb]uild [Vv]ersion is not the string Step Dialog.

You'll notice I also changed it to a positive lookahead, because it's easier to understand what's going on.

edited Oct 14 '11 at 15:01

answered Oct 14 '11 at 14:54

Chriszuma

4,464
22
19

1

Excellent, thanks much! I also learned about (?:...) from this, which is functionality I've wondered about before. – Nathan Oct 14 '11 at 15:16

score 0 · Answer 2 · edited May 23 '17 at 09:59

0

Couple ways you can do this, but you're pretty close.

`.*(?<!Step Dialog.*)[Bb]uild [Vv]ersion:*\s*(?!.*<)'?([^']*)'`
`^(?!.*Step Dialog).*[Bb]uild [Vv]ersion:*\s*(?!.*<)'?([^']*)'`

Chriszuma's pattern should work, too. Use whichever you like best. If performance is a consideration, you could benchmark the three patterns and see which is faster. My feeling is that it'll be the one starting with ``.(?)`, but I can't say for sure.

Edit: As ekhumoro points out, the Python regex engine requires fixed-length lookbehinds, so the first one won't work in Python. The second one should be fine, though.

edited May 23 '17 at 09:59

Community

1
1

answered Oct 14 '11 at 15:04

Justin Morgan - On strike

30,035
12
80
104

3

the first of these patterns will give a compilation error because the look-behind is not fixed-width. – ekhumoro Oct 14 '11 at 15:17
@ekhumoro - Good catch. I forgot about Python's distaste for variable-width lookbehinds. Edited – Justin Morgan - On strike Oct 14 '11 at 16:05
Why was this downvoted? If it's still incorrect, please explain why. – Justin Morgan - On strike Mar 30 '13 at 16:50

Python regex negative lookbehind

2 Answers2

Linked