Regular expression negative lookbehind of non-fixed length

Question

As the document goes:

This is called a negative lookbehind assertion. Similar to positive lookbehind assertions, the contained pattern must only match strings of some fixed length.

So this will work, the intention is to match any , outside {}, but not inside {}:

In [188]:

re.compile("(?<!\{)\,.").findall('a1,a2,a3,a4,{,a6}')
Out[188]:
[',a', ',a', ',a', ',{']

this will work, on a slightly different query:

In [189]:

re.compile("(?<!\{a5)\,.").findall('a1,a2,a3,a4,{a5,a6}')
#or this: re.compile("(?<!\{..)\,.").findall('a1,a2,a3,a4,{a5,a6}')
Out[189]:
[',a', ',a', ',a', ',{']
In [190]:

But if the query is 'a1,a2,a3,a4,{_some_length_not_known_in_advance,a6}', according to the document the following won't work as intended:

In [190]:

re.compile("(?<![\{.*])\,.").findall('a1,a2,a3,a4,{a5,a6}')
Out[190]:
[',a', ',a', ',a', ',{', ',a']

Any alternative to achieve this? Is negative lookbehind the wrong approach?

Any reason this is how lookbehind was designed to do (only match strings of some fixed length) in the first place?

score 12 · Accepted Answer · edited May 23 '17 at 12:00

12

Any alternative to achieve this?

Yes. There is a a brilliantly simple technique, and this situation is very similar to "regex-match a pattern unless..."

Here's your simple regex:

{[^}]*}|(,)

The left side of the alternation | matches complete { brackets } tags. We will ignore these matches. The right side matches and captures commas to Group 1, and we know they are the right commas because they were not matched by the expression on the left.

Here is a demo that performs several tasks, so you can pick and choose (see the output at the bottom of the demo):

Count the commas you want to match (not those between braces)
Show the matches (commas... duh)
Replace the right commas. Here we replace with SplitHere so we can perform task 4...
Split on the commas, and display the split strings

Reference

How to match (or replace) a pattern except in situations s1, s2, s3...

edited May 23 '17 at 12:00

Community

1
1

answered Jun 07 '14 at 03:57

zx81

41,100
9
89
105

Wonderful, I felt I was on the wrong track. Let me try it out. Any idea why negative lookbehind only works for fixed length by design? – CT Zhu Jun 07 '14 at 04:01
@CTZhu I added a full Python program that counts the right commas, shows them, replaces them, and splits the string. :) So that shows you how to do all the main things with this technique. – zx81 Jun 07 '14 at 04:06
1

@CTZhu `Any idea why negative lookbehind only works for fixed length by design?` That actually depends on your regex engine. In .NET you can have infinite-width lookbehind... and also in Python!!! But only if you use Matthew Barnett's alternate (and awesome) [`regex`](https://pypi.python.org/pypi/regex) module. As to why... sure, it's more work and chances for catastrophic backtracking, esp in the old days of small RAM. :) An old-school workaround is to reverse the string and use a lookahead. – zx81 Jun 07 '14 at 04:09
`re.compile("{[^}]*}|(,)").findall('a1,a2,a3,a4,{a5,a6}')` actually return an extra `''`: `[',', ',', ',', ',', '']`. Any chance I can get rid of it just via `re`? Thanks for the tip of `regex`, I will take a close look at it. – CT Zhu Jun 07 '14 at 04:12
@CTZhu YES but you need to look closely at the demo I sent you, as you are not using the regex correctly. The demo inspects Group 1. The point is not to use `findall` directly. As the explanation mentions, we don't care about the matches on the left of the regex, so we need to inspect Group 1 captures. – zx81 Jun 07 '14 at 04:16
Thanks, actually I just realized it when look at the code you posted and was about to delete the previous comments. Perfect! Really appreciated. I only used `re` a few times a year so that part is very rusty on my side. – CT Zhu Jun 07 '14 at 04:19
1

@CTZhu Also since you like the trick, I highly recommend you take a look at (or save for later) the [linked question about exclusions in regex patterns](http://stackoverflow.com/questions/23589174/match-or-replace-a-pattern-except-in-situations-s1-s2-s3-etc/23589204#23589204), I had a lot of fun writing it. :) – zx81 Jun 07 '14 at 04:25

hwnd · Answer 2 · 2014-06-07T05:24:36.493

3

Instead of using Negative Lookbehind, you can use Negative Lookahead with balanced braces.

,(?![^{]*\})

For example:

>>> re.findall(r',..(?![^{]*\})', 'a1,a2,a3,a4,{_some_unknown_length,a5,a6,a7}')
[',a2', ',a3', ',a4']

edited Jun 07 '14 at 05:24

answered Jun 07 '14 at 04:40

hwnd

69,796
4
95
132

Thanks! A clever method of not matching any `,` followed by `[^{]*\}`. – CT Zhu Jun 07 '14 at 23:44

Regular expression negative lookbehind of non-fixed length

2 Answers2

Linked

Related