0

Even though I've looked through multiple examples of how to capture a repeated group. I cant seem to be able to understand it correctly for my case to make it work.

use ((?:([a-z_0-9]*)\.)*)(\w*);

Regex example

As shown in the above example, I wish to capture the package name in segments so that use ieee.std_logic_1164.all; returns [IEEE, std_logic_1164, all] etc. Currently my regex returns [ieee.std_logic_1164., std_logic_1164, all] which is wrong.

I am missing something from the expression I've done. It seems that I can only capture the last iteration of the repeated group even though I've encapsulated my repeated section in another capturing group according to the specification.

Regards

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
Ephreal
  • 1,835
  • 2
  • 18
  • 35
  • What's it supposed to return instead `[IEEE, std_logic_1164, all]`? Or do you mean it does not capture the rest that came from words? So, what did you expect? – Artyom Vancyan Feb 28 '23 at 09:24
  • 2
    Just take the match and split it afterwards. – trincot Feb 28 '23 at 09:31
  • @ArtyomVancyan Looking at my example I get the following captured [ieee.std_logic_1164., std_logic_1164, all], What I need is the above list – Ephreal Feb 28 '23 at 09:40
  • @trincot Looking at the answers it seems as the best suggestion this also works. But is not possible to achieve this with regex? – Ephreal Feb 28 '23 at 09:47
  • @Ephreal, it's possible – RomanPerekhrest Feb 28 '23 at 09:50
  • Your question would be easier to follow if you placed what is in the link as text in your question. Flipping back and forth is inefficient and if the link ever breaks your question becomes unanswerable. – Cary Swoveland Feb 28 '23 at 10:05

4 Answers4

3

Another option with the PyPi regex module using allcaptures() or capturesdict() and the capture groups with the same name, in this example name.

If you want the first item in the result list to be upper cased, you can use .upper()

\buse (?:(?<name>\w+)\.)*(?<name>\w+);

See a regex demo with a different engine selected, but showing the same idea.

import regex as re

s = ("library ieee;\n"
"use ieee.std_logic_1164.all;\n"
"use ieee.numeric_std.all;\n"
"use work.pk_avalon_mm_extif_defs.all;\n"
"use work.common_tb_pkg.all;\n"
"package common_register_interface_pkg is")

pattern = r"\buse (?:(?P<name>\w+)\.)*(?P<name>\w+);"
matches = re.finditer(pattern, s)

for matchNum, m1 in enumerate(matches, start=1):
    print(m1.allcaptures()[1])

for matchNum, m1 in enumerate(matches, start=1):
    print(m1.capturesdict()['name'])

Both will output

['ieee', 'std_logic_1164', 'all']
['ieee', 'numeric_std', 'all']
['work', 'pk_avalon_mm_extif_defs', 'all']
['work', 'common_tb_pkg', 'all']
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
1

You cannot have a variable number of capture groups. So either you must have a separate capture group for each chunk you want to match (foreseeing enough of them), or else you match them all in one capture group, and split them later.

With the first idea you'd still have to identify for each match, which capture groups resulted in an empty string, which means it should be ignored. This seems not give any benefit over the second idea.

Here is how the second idea would work:

s = """-------------------------------------------------------------------------------
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

use work.pk_avalon_mm_extif_defs.all;
use work.common_tb_pkg.all;

package common_register_interface_pkg is"""

results = [m.split(".") for m in re.findall(r"use ((?:(?:[a-z_0-9]*)\.)*\w*);", s)]

print(results)
trincot
  • 317,000
  • 35
  • 244
  • 286
  • Your first paragraph omits the option of matching, but not capturing, the strings of interest, which is what @anubhava has done. Слава Україні! – Cary Swoveland Feb 28 '23 at 10:59
  • Героям слава! ... Yes, but it needs sticky matching (not supported by native `re`), and you need to implement logic in Python to combine the parts which belong to the same library (by detecting the match of "use "). – trincot Feb 28 '23 at 13:08
  • I know but a little Python, but I assumed all Pythoneers now use the regex module because it is much superior to the re module. No? Слава регулярним виразам! – Cary Swoveland Feb 28 '23 at 18:01
  • I have upvoted the other answer. I would not assume `regex` when asker has not mentioned it. The question could have been clearer... – trincot Feb 28 '23 at 18:04
1

I don't think your pattern needs groups. You can specify that the searched expression is preceded by use and followed by ; with lookarounds. Then just split your matches along . as suggested in a comment:

import re

text = """library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

use work.pk_avalon_mm_extif_defs.all;
use work.common_tb_pkg.all;

package common_register_interface_pkg is"""

matches = re.findall(r"(?<=use )(?:\w+\.)+\w*(?=;)", text)
res = [m.split('.') for m in matches]

Output:

[['ieee', 'std_logic_1164', 'all'], ['ieee', 'numeric_std', 'all'], ['work', 'pk_avalon_mm_extif_defs', 'all'], ['work', 'common_tb_pkg', 'all']]
Tranbi
  • 11,407
  • 6
  • 16
  • 33
1

If you can install PyPi regex module then you can do this in a single regex:

(?:^use\s+|(?!^)\G\.)\K\w+

RegEx Demo

RegEx Details:

  • (?:: Start non-capture group
    • ^: Match start position
    • use: Match use
    • \s+: Match 1+ whitespaces
    • |: OR
    • (?!^)\G: Start at the end position of the previous match
    • \.: Match a DOT
  • ): End non-capture group
  • \K: Reset match info
  • \w+: Our match that is 1+ word characters
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • What would be the Python code to combine the matched segments in lists? – trincot Feb 28 '23 at 13:16
  • @trincot: That would be: `regex.findall(r'(?m)(?:^use\s+|(?!^)\G\.)\K\w+', s)` and we need ti `import regex` before hand. – anubhava Feb 28 '23 at 15:02
  • But that puts all results in the same list. Although the Asker left this unspecified, it would seem reasonable that they want to combine those matches that belong to the same input line. How would that be done? – trincot Feb 28 '23 at 15:05
  • From the question it appears OP is running it one line at a time. I note this in question `I wish to capture the package name in segments so that use ieee.std_logic_1164.all; returns [IEEE, std_logic_1164, all]`. If OP clarifies this further I will post a working code using this approach. – anubhava Feb 28 '23 at 15:13
  • 1
    I agree, this needs clarification first. – trincot Feb 28 '23 at 15:18