1

As a learning exercise, I like to compare two regular expressions doing the same thing.

In this case, I want to extract the sequences of numbers from strings like this:

CC_nums=[
'2341-3421-5632-0981-009',
'521-9085-3948-2543-89-9'
]

And the correct result after capturing in a regex will be

['2341', '3421', '5632', '0981', '009']
['4521', '9085', '3948', '2543', '89', '9']

I understand that this works in python:

for number in CC_nums:
    print re.findall('(\d+)',number)

But, to understand this more deeply, I tried the following:

for number in CC_nums:
    print re.findall('\s*(?:(\d+)\D+)+(\d+)\s*', number)

..which returns:

[('0981', '009')]
[('89', '9')]

Two questions:

Firstly, why does the second one return a tuple instead of a list? Secondly, why does the second one not match the other sets of digits, like 2341, 3241, etc.?

I know that findall will return non-overlapping capturing groups, so I tried to avoid this. The capturing groups are non-overlapping because of the (\d+), so I thought that this would not be an issue.

makansij
  • 9,303
  • 37
  • 105
  • 183

1 Answers1

3

See Python re.findall behaves weird to see why the re.findall returns a tuple list. Basically, it returns a tuple because there are more than one capturing group inside your pattern.

The regex returns the last digits-digits substring because the + quantifier is applied to the (?:(\d+)\D+) group, and thus, each time this subpattern captures a substring, the previous one is replaced with the new one in the group buffer.

Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • OK thank you, so if I understand correctly, this is a more elaborate explanation of what you're trying to tell me? http://www.regular-expressions.info/captureall.html – makansij Mar 02 '16 at 07:45
  • Yes, *the repeated capturing group will capture only the last iteration, while a group capturing another group that's repeated will capture all iterations.* I tried to use simpler words in a more concise way. Do you want me to add a step-by-step description of what is happening? – Wiktor Stribiżew Mar 02 '16 at 07:48
  • A step-by-step might help, but here's wha confuses me more: I thought that this wouldn't work: `re.findall('(\d+)+', number)` because it *repeats the capturing group*, as you said. But, for some reason that works perfectly fine? It makes complete sense that this would work: `re.findall('((?:\d+)+)', number)`, because it *captures the repeated group*, but why `(\d+)+` also works is beyond me.....? – makansij Mar 05 '16 at 21:14
  • I do not know what you mean by "work perfectly fine". `(\d+)+` matches and captures 1+ occurrences of 1+ digits. However, this expression is a very ineffecient one as there is a nested quantifier. It must be written as `\d+`. – Wiktor Stribiżew Mar 05 '16 at 21:20
  • But I thought that a *repeated capturing group will capture only the last iteration*, as you said? It captures all of the groups, it seems. Try `for number in CC_nums: print re.findall('(\d+)+',number)`, and you'll see what I mean – makansij Mar 05 '16 at 21:22
  • [It is correct and expected](http://ideone.com/637kBF): `(\d+)+` *matches and captures 1+ occurrences of 1+ digits*. The engine sees and seizes `2341` with `\d+`, places it in Group 1, tries to repeat that the second time, but finds `-` -> fail. So, Group 1 contains `2341`. Then comes the second match, and the third and so on. – Wiktor Stribiżew Mar 05 '16 at 21:30