Capturing repeating subpatterns in Python regex

Question

While matching an email address, after I match something like yasar@webmail, I want to capture one or more of (\.\w+)(what I am doing is a little bit more complicated, this is just an example), I tried adding (.\w+)+ , but it only captures last match. For example, yasar@webmail.something.edu.tr matches but only include .tr after yasar@webmail part, so I lost .something and .edu groups. Can I do this in Python regular expressions, or would you suggest matching everything at first, and split the subpatterns later?

Capturing repeated expressions was proposed in [Python Issue 7132](https://bugs.python.org/issue7132) but rejected. It is however supported by the third-party [regex](https://pypi.org/project/regex/) module. — Todd Owen, Oct 15 '18 at 00:27
@ToddOwen But, isn't this now possible in 2.7? I don't know when it became possible. But, the answer from https://stackoverflow.com/a/9765037/3541976 seems to work just fine for me in 2.7 using the re module. — Michael Ohlrogge, Nov 25 '18 at 00:22
@MichaelOhlrogge Issue 7132 is about what happens if the capturing parentheses are _inside_ a repeat. The issue is not fixed, and will still only keep the last match. A possible workaround, as mentioned in the answer you linked to, is to put the capturing parentheses _around_ a repeating pattern. (Note that `(?: ...)` are not capturing parentheses). — Todd Owen, Nov 28 '18 at 21:36
@ToddOwen Got it, thank you, that is a helpful clarification! — Michael Ohlrogge, Nov 29 '18 at 01:03

score 40 · Answer 1 · edited May 23 '17 at 12:09

40

re module doesn't support repeated captures (regex supports it):

>>> m = regex.match(r'([.\w]+)@((\w+)(\.\w+)+)', 'yasar@webmail.something.edu.tr')
>>> m.groups()
('yasar', 'webmail.something.edu.tr', 'webmail', '.tr')
>>> m.captures(4)
['.something', '.edu', '.tr']

In your case I'd go with splitting the repeated subpatterns later. It leads to a simple and readable code e.g., see the code in @Li-aung Yip's answer.

edited May 23 '17 at 12:09

Community

1
1

answered Mar 19 '12 at 05:22

jfs

399,953
195
994
1,670

Out of curiosity, how do you write a replacement pattern when you match repeated captures? Does the meaning of `\1`, `\2`, `\3` etc. change depending on how many times you matched `(\.\w+)`? – Li-aung Yip Mar 19 '12 at 07:55
@Li-aung Yip: `\1` corresponds to `m.group(1)`; the meaning hasn't changed. You could use a function as a replacement pattern and call `m.captures()` in it. – jfs Mar 19 '12 at 09:03
In your example, the meaning of `\1`, `\2`, and `\3` is obvious because they only capture once. But what is the meaning of `\4`, corresponding to `(\.\w+)+`? `\4` appears to be "the last substring matched by the 4th capture group", in this case `.tr`. – Li-aung Yip Mar 19 '12 at 09:12
@Li-aung Yip: `m.groups()` above explicitly shows what `\4` is. – jfs Mar 19 '12 at 09:13
The meaning hasn't changed: `\4` is `m.group(4)` whatever it is. – jfs Mar 19 '12 at 09:21

score 13 · Answer 2 · answered Mar 19 '12 at 04:28

13

You can fix the problem of (\.\w+)+ only capturing the last match by doing this instead: ((?:\.\w+)+)

answered Mar 19 '12 at 04:28

Taymon

24,950
9
62
84

2

For abbreviations (if you've lower-cased): `re.sub(ur'((?:[a-z]\.){2,})', lambda m: m.group(1).replace('.', ''), text)` – scharfmn Aug 15 '15 at 09:58
1

Thanks. I was able adding parentheses allowed me to match a repeated subpattern, but then there was a group in the match with the last one of the pattern. I hadn't seen that `(?: ...)` makes a non-capturing group. https://docs.python.org/2/library/re.html#regular-expression-syntax Adding that fixes that problem. – Tim Swast Jul 21 '16 at 22:22
this doesn't split the groups – Jules G.M. Nov 18 '22 at 21:26

score 13 · Answer 3 · edited May 23 '17 at 11:46

This will work:

>>> regexp = r"[\w\.]+@(\w+)(\.\w+)?(\.\w+)?(\.\w+)?(\.\w+)?(\.\w+)?"
>>> email_address = "william.adama@galactica.caprica.fleet.mil"
>>> m = re.match(regexp, email_address)
>>> m.groups()
('galactica', '.caprica', '.fleet', '.mil', None, None)

But it's limited to a maximum of six subgroups. A better way to do this would be:

>>> m = re.match(r"[\w\.]+@(.+)", email_address)
>>> m.groups()
('galactica.caprica.fleet.mil',)
>>> m.group(1).split('.')
['galactica', 'caprica', 'fleet', 'mil']

Note that regexps are fine so long as the email addresses are simple - but there are all kinds of things that this will break for. See this question for a detailed treatment of email address regexes.

score 4 · Answer 4 · answered Oct 04 '17 at 18:22

4

This is what you are looking for:

>>> import re

>>> s="yasar@webmail.something.edu.tr"
>>> r=re.compile("\.\w+")
>>> m=r.findall(s)

>>> m
['.something', '.edu', '.tr']

answered Oct 04 '17 at 18:22

Tushar Vazirani

1,011
13
14

1

This doesn't match for the `yasar@webmail`. As such, it could easily pick up false positive results where there are things other than email addresses with multiple periods separating them. – Michael Ohlrogge Nov 24 '18 at 18:07
1

OP has clearly written that this is just an example and what he is trying to do is more complicated. Hence, my answer. – Tushar Vazirani Nov 24 '18 at 18:09
2

Yes, but the problem is that your solution won't work even on the simplified version of the problem OP gave. Your solution is trivially simple for anyone with even the most basic understanding of RegEx. All other answers are more complicated because this is a genuinely non-trivial problem to solve. – Michael Ohlrogge Nov 24 '18 at 18:31

Capturing repeating subpatterns in Python regex

4 Answers4

Linked

Related