0

Below is the code:


import re
 
a = "111234567890"

b = re.finditer(r"((?<=\d)\d{3})+", a)
print("finditer: ")
print(b)
for item in b:
    print(item.group(), end='\n\n')
 
c = re.findall(r"((?<=\d)\d{3})+", a)
print("findall: ")
print(c, end='\n\n')
 
d = re.search(r"((?<=\d)\d{3})+", a)
print("search: ")

As shown above, I searched in python3 doc, and found that findall and finditer should both similar to search which check for a match anywhere, not like match just search only at beginning. Thus I user this three to compare, and run code to get below result:

finditer: 
<callable_iterator object at 0x000002712064F948>
112345678

findall: 
['678']

search: 
<re.Match object; span=(1, 10), match='112345678'>

Then my question is:

if my positive lookbehind assertion without "+", which changed to r"((?<=\d)\d{3})" then to match this assertion, result should be 1 digit lookbehind with 3 more digits. Thus:

re.search result should be:123
re.findall result should be:['112', '345', '678']
re.rearch result should be: ['112', '345', '678']

But as my code shown above, I have '+' added after r"((?<=\d)\d{3})", assertion become r"((?<=\d)\d{3})+", that means result should match this (1digt+3digit behind) case for 1 or more time, Then result should combine [112,345,678], As the result shown above:

re.search get the result like'112345678'.
Also re.finditer get the same result as shown above.
However, re.findall get different result as ['678'] shown above.

Can anyone help me to explain why assertion add '+' behind will make re.findall get different result with re.search and re.finditer?

Thanks a lot

  • I could be wrong. But I think the issue is "Where does the next search start?" When you call re.findall on a string and it finds a match, it starts the next search at the character following the match. Those characters are no longer available, not even for a ?<=. – Frank Yellin Jan 25 '22 at 20:49
  • The `+` _after_ the closing grouping parenthesis does weird things, in the `findall` case it can only return one captured group (because what else could a single pair of capturing parentheses possibly return?) and so it returns the last one after the `+` repeated as many times as possible. It's not clear that any of these behaviors is a bug, or isn't. Perhaps this is simply undefined behavior. – tripleee Jan 25 '22 at 21:09
  • If you examine the match objects from `finditer` and `match` I suspect you will find that they have a `group(1)` which contains exactly the same thing as in the `findall` case. – tripleee Jan 25 '22 at 21:15
  • @tripleee yes, finditer got 1 subgroup same as findall. Base on your explain I get 3 conclusion, can you help to see whether those are correct or not? **`1:`** only when `zero` or `one` grouping parenthesis exist, `findall` and `finditer` return same result. **`2:`** if multiple grouping parenthesis exist(without `+`), `search` will return entire match; `findall` will return all parenthesized subgroups; while `finditer` is the combine of those two, `group()` of match object in `finditer` is same as `search` result, and rest subgroups(group(1),group(2)...) are same as subgroups of `findall`. – YYDonald Jan 26 '22 at 04:04
  • @tripleee **`3:`** if `+` is used behind grouping parenthesis(which may consider as multiple grouping parenthesis). The entire match result should be same as `condition 2` above(multiple grouping parenthesis exist), thus result of `search` is same as match object of `group()` in `finditer`. But for parenthesized subgroups, only the last one will be keep, that means both `findall` and match objects of `finditer` will at most have `only 1` parenthesized subgroup, which is the latest matched augument. – YYDonald Jan 26 '22 at 04:09
  • 1 and 2 look dubious. `search` merely tells you whether the regex matches but it will coincidentally also tell you which part matched. You can perfectly well have multiple parenthesized groups, but each group can only contain one match at a time. – tripleee Jan 26 '22 at 06:10

0 Answers0