Python - odd regex matching with + / * on group

Question

>>> src = '  pkg.subpkg.submod.thing  pkg2.subpkg.submod.thing  '
>>> re.search(r'\s*(\w+\.)+', src).groups()
('submod.',)

This regex seems to put everything which is not space into a/the group - nothing to be lost before stop of regex match.

Why is just the last "+" repetition found in the group here - and not ('pkg.subpkg.submod.',)?

Or ('pkg.',) - early stop because no real repetition - no "loss of information" in another sense?

(I needed to use another (?:...) like r'\s((?:\w+\.)+)')

Even more strange:

>>> src = '  pkg.subpkg.submod.thing  pkg2.subpkg.submod.thing  '
>>> re.search(r'\s(\w+\.)*', src).groups()
(None,)

Edit: the "more strange" is actually "less strange" as @Avinash Raj pointed out, because - unlike intended - the match simply ends before the group; So

>>> re.search(r'\s+(\w+\.)*', '  pkg.subpkg.submod.thing').groups()
('submod.',)

.. then produces the same questioned behavior than "+" : just last repetition - things before seeming lost...

See [Capturing repeating subpatterns in Python regex](http://stackoverflow.com/questions/9764930/capturing-repeating-subpatterns-in-python-regex). — Wiktor Stribiżew, Apr 27 '17 at 10:22

Avinash Raj · Answer 1 · 2017-04-27T10:34:53.803

1

I'll explain the even more strange part..

src = '  pkg.subpkg.submod.thing  pkg2.subpkg.submod.thing  '

re.search stops matching once it finds a first match. So,

r'\s(\w+\.)*' would match the first space character (* repeats the previous pattern zero or more times), since there is no match for (\w+\.)* after the first space, groups() function on searchObj returns None and group on searchObj should return the space that is the first space.

edited Apr 27 '17 at 10:34

answered Apr 27 '17 at 10:29

Avinash Raj

172,303
28
230
274

Oh, yes - in a hurry I intended something like `>>> re.search(r'\s(\w+\.)*', ' pkg.subpkg.submod.thing').groups() ('submod.',)` (the group actually gets something) which then yields the same thing as "+" – kxr Apr 27 '17 at 14:22
@kxr why they used another capturing group is, to capture the repeated groups. take this example `(a)+` in the input string `aaa` should match all the `a`'s but capture only the last `a` because you're repating the `a` as-well as the capturing group one or more times. Actually you have to use the plus inside the capturing group `(a+)` . Hope you get cleared.. – Avinash Raj Apr 27 '17 at 14:25
`a` from `(a)+` and consuming up to the last possible `a` is more understandable because of identical repetitions - like in `re.search(r'(a)(\1)', 'aaabb').groups()`. But here there seems to be sort of "information loss" – kxr Apr 27 '17 at 15:32

am2 · Answer 2 · 2017-04-27T11:14:49.370

I do not know, why it is strange for you. What do you expect?

In the documentation you find the following:

re.search(pattern, string, flags=0) Scan through string looking for the first location where the regular expression pattern ...

re.search(r'\s*(\w+\.)+', src).groups()

in your search string you have only one group: (\w+.) Because it is greedy by default all the pkg.subpkg. is eaten before you find submod., this is the last that is filled, that the string matches.

your second try doesn't match, cause there is not even 1 group nessesary to fulfil the Statement, so all 3 parts are eaten and inside the Group you find nothing.

Do you look for this?

re.search(r'\s*((\w+\.)+)', src).groups()[0]

Try out the following to understand it better:

re.search(r'\s*((\w+\.)*)(\w+\.)*', 'a.b.c.d.e.f.g.h.i').groups()

score -1 · Answer 3 · answered Apr 27 '17 at 10:27

-1

This should work fine to match the complete string ' pkg.subpkg.submod.thing pkg2.subpkg.submod.thing '

(\s*(\w+[.\s])+)+

In case you want the output ' pkg.subpkg.submod.thing ' then use this

\s*(\w+[.\s])+

answered Apr 27 '17 at 10:27

deathNode

134
1
1
7

should help nevertheless, no? Since he seemed to be struggling with 2 tries – deathNode Apr 27 '17 at 10:35

Python - odd regex matching with + / * on group

3 Answers3