4

In Python, how do you capture a group within a non-capturing group? Put in another way, how do you repeat a non-capturing sub-pattern that contains a capturing group?

An example of this would be to capture all of the package names on an import string. E.g. the string:

import pandas, os, sys

Would return 'pandas', 'os', and 'sys'. The following pattern captures the first package and gets up to the second package:

import\s+([a-zA-Z0=9]*),*\s*

From here, I would like to repeat the sub-pattern that captures the group and matches the following characters, i.e.([a-zA-Z0=9]*),*\s*. When I surround this sub-pattern with a non-capturing group and repeat it:

import\s+(?:([a-zA-Z0=9]*),*\s*)*

It no longer captures the group inside.

Bryce93
  • 451
  • 7
  • 11
  • If you want that functionality use PyPi regex module. – Wiktor Stribiżew Sep 09 '16 at 17:36
  • 1
    The issue is not in capturing groups and non-capturing groups, the issue is trying to get an _unset_ amount of variables for further use, using `*` for capturing groups will hardly ever yield the results you're looking for . This is not something regex is generally used for. Instead the rational thing would be to get the whole import package set and then split the string by `,\s*(?=\w)` or something like that. – Andris Leduskrasts Sep 09 '16 at 17:56
  • Does this answer your question? [How can I repeat a capturing group one or more time and extract matches](/a/33843279/90527) – outis Sep 01 '22 at 08:48

3 Answers3

1

Your question is phrased strictly about regex, but if you're willing to use a recursive descent parser (e.g., pyparsing), many things that require expertise in regex, become very simple.

E.g., here what you're asking becomes

from pyparsing import *

p = Suppress(Literal('import')) + commaSeparatedList

>>> p.parseString('import pandas, os, sys').asList()
['pandas', 'os', 'sys']

>>> p.parseString('import                    pandas,             os').asList()
['pandas', 'os']

It might be a matter of personal taste, but to me,

Suppress(Literal('import')) + commaSeparatedList

is also more intuitive than a regex.

Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
0

A repeated capturing group will only capture the last iteration. This is why you need to restructure your regex to work with re.findall.

\s*
(?:
  (?:^from\s+
    (  # Base (from (base) import ...)
      (?:[a-zA-Z_][a-zA-Z_0-9]*  # Variable name
        (?:\.[a-zA-Z_][a-zA-Z_0-9]*)*  # Attribute (.attr)
      )
    )\s+import\s+
  )
|
  (?:^import\s|,)\s*
)
(  # Name of imported module (import (this))
  (?:[a-zA-Z_][a-zA-Z_0-9]*  # Variable name
    (?:\.[a-zA-Z_][a-zA-Z_0-9]*)*  # Attribute (.attr)
  )
)
(?:
  \s+as\s+
  (  # Variable module is imported into (import foo as bar)
    (?:[a-zA-Z_][a-zA-Z_0-9]*  # Variable name
      (?:\.[a-zA-Z_][a-zA-Z_0-9]*)*  # Attribute (.attr)
    )
  )
)?
\s*
(?=,|$)  # Ensure there is another thing being imported or it is the end of string

Try it on regex101.com

Capture group 0 will be the Base, capture group 1 will be (What you're after) the name of the imported module, and capture group 2 will be the variable the module is in (from (group 0) import (group 1) as (group 2))

import re

regex = r"\s*(?:(?:^from\s+((?:[a-zA-Z_][a-zA-Z_0-9]*(?:\.[a-zA-Z_][a-zA-Z_0-9]*)*))\s+import\s+)|(?:^import\s|,)\s*)((?:[a-zA-Z_][a-zA-Z_0-9]*(?:\.[a-zA-Z_][a-zA-Z_0-9]*)*))(?:\s+as\s+((?:[a-zA-Z_][a-zA-Z_0-9]*(?:\.[a-zA-Z_][a-zA-Z_0-9]*)*)))?\s*(?=,|$)"

print(re.findall(regex, "import pandas, os, sys"))
[('', 'pandas', ''), ('', 'os', ''), ('', 'sys', '')]

You can remove the other two capturing groups if you don't care for them.

Artyer
  • 31,034
  • 3
  • 47
  • 75
0

You can use your import\s+(?:([a-zA-Z0-9=]+),*\s*)* regex (I just fixed the 0-9 range to match any digit and included = to the end) and access the Group 1 capture stack using PyPi regex module:

>>> import regex
>>> s = 'import pandas, os, sys'
>>> rx = regex.compile(r'^import\s+(?:([a-zA-Z0-9=]+),*\s*)*$')
>>> print([x.captures(1) for x in rx.finditer(s)])
[['pandas', 'os', 'sys']]
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563