Python regex: overlapping patterns

Question

Suppose I have a string:

string = 'AvBvC'

I want to match A, B, and C, and this is what I did:

match = re.search('(.*)v(.*)', string)
print match.groups()

The problem is, the result shows that:

('AvB', 'C',)

instead of what I want, which is

('A', 'B', 'C',)

How do I make it catch all overlapping patterns..?

Thanks.

(I know there are some posts concerning the same issue, but haven't found a definite answer for Python)

@PeterDeGlopper Yeah!! just modified my post (my original code is way longer and more complicated than this,, sorry) — user2492270, Dec 08 '13 at 04:04
More details would help - you can do what the above question asks with just `split('v')`, so I can only assume you have a more complicated situation. — Peter DeGlopper, Dec 08 '13 at 04:06
This may be what you are after? http://stackoverflow.com/questions/5616822/python-regex-find-all-overlapping-matches — OllyTheNinja, Dec 08 '13 at 04:06
@PeterDeGlopper http://stackoverflow.com/questions/20449800/python-regex-nested-parenthesis — user2492270, Dec 08 '13 at 04:13
If I remember my computing theory correctly, standard regular expressions can't parse nested parens. Some languages have recursion extensions to their regexps but python does not. — Peter DeGlopper, Dec 08 '13 at 04:33
By default, regular expressions are greedy, so they will try to match as much as possible. Hence, `.*v.*`, which matches any run of characters, will match `('AvB', 'C')`. Please read the entirety of http://docs.python.org/2/library/re.html — IceArdor, Dec 08 '13 at 04:40

hwnd · Answer 1 · 2013-12-08T04:43:42.663

2

Your question is somewhat unclear, you seem to have more of a complicated string than you actual show.

Using search() matches only the first occurrence, you can use findall() to match all occurrences.

matches = re.findall(r'[^v]+', string)
['A', 'B', 'C']

Another option would be to split on certain characters that you need to split on.

>>> re.split('v', 'AvBvC')
['A', 'B', 'C']

edited Dec 08 '13 at 04:43

answered Dec 08 '13 at 04:03

hwnd

69,796
4
95
132

This works for my example, but my actual string contains things that are not single character such as A,B,C... So is there a way to do this while using "search" instead of "findall"?? – user2492270 Dec 08 '13 at 04:06
`search()` and `match()` return exactly 1 result. So, no. You need to make your question clearer ;-) – Tim Peters Dec 08 '13 at 04:07
@TimPeters Why negative lookahead doesnt allow me to match beginning of the string? – thefourtheye Dec 08 '13 at 04:09
@thefourtheye, huh? Without some context (or code), I don't know what you're asking - sorry. – Tim Peters Dec 08 '13 at 04:10
@TimPeters `print re.findall("(?<=v|^).*?(?=v|$)", myString)` I tired this and it throws `sre_constants.error: look-behind requires fixed-width pattern` – thefourtheye Dec 08 '13 at 04:11
@hwnd http://stackoverflow.com/questions/20449800/python-regex-nested-parenthesis – user2492270 Dec 08 '13 at 04:12
@thefourtheye, that's look-behind, not look-ahead ;-) The error msg explained it: all alternatives in a lookbehind must match the same number of characters. `v` matches 1 character but `^` matches 0 characters. That's all there is to it. – Tim Peters Dec 08 '13 at 04:13
@TimPeters Some people refer to that as negative look-ahead ;) If that is the case, why look-ahead accepts `$`? – thefourtheye Dec 08 '13 at 04:15
1

@thefourtheye, then some people are bound to get confused by sloppy terminology ;-) The "fixed width" restriction is unique to look-behinds - it does not apply to look-aheads. The reason is simply the sheer difficulty of implementing varying-width look-behinds; they're "highly unnatural" for a "left to right" search engine. – Tim Peters Dec 08 '13 at 04:17
@TimPeters Cool :) Thanks :) BTW, you mind [joining us](http://chat.stackoverflow.com/rooms/6/python) sometime – thefourtheye Dec 08 '13 at 04:19

score 2 · Answer 2 · answered Dec 08 '13 at 04:37

2

Use re.split

>>> import re
>>> re.split('v', 'AvBvC')
['A', 'B', 'C']

And to demonstrate further...

>>> re.split('vw', 'AAvwBBvwCC')
['AA', 'BB', 'CC']

answered Dec 08 '13 at 04:37

FogleBird

74,300
25
125
131

Python regex: overlapping patterns

2 Answers2