-1

Sample text:

This is HeaderA
 Line 1
 Line 2
 Line 3
 Line 4
 Line 5
This is HeaderB
 Line 1
 Line 2

Intended result:

HeaderA1 HeaderA2 HeaderA3 HeaderA4 HeaderA5

HeaderB1, HeaderB2

Regex Attempts:

(?:^This is (?P<H>HeaderB)\s) (Line (?P<L>\d)\s)*?

  • Matches only the Header 'H' and 1st 'L' Line

(?:^This is (?P<H>HeaderB)\s)? (Line (?P<L>\d)\s)*?

  • manage to match multiple 'L' Lines however, only first 2 line are of the same match, not the subsequent L lines does not reference the Header capture group.

I tried other attempts to adjust the regex but ended up screwing up the expression. I have limited experience with regex, so I am not entirely sure if it is possible to get the desired output.

Wai Ha Lee
  • 8,598
  • 83
  • 57
  • 92
Jmgbnt
  • 3
  • 1
  • 1
    I don't think you can do that with a single regex. Which flag did you used for your attempts? Could you add your code? – cards Aug 13 '21 at 17:22

2 Answers2

0

Mix of regex and substitutions with format.

It is assumed that below a Header you always have a Line i

import re
text = """This is HeaderA
 Line 1
 Line 2
 Line 3
 Line 4
 Line 5
This is HeaderB
 Line 1
 Line 2"""

ordered_matches = [] # global

def custom_match(m, all_matches=ordered_matches):
    p = m.group(0)
    if p.isdigit():
        all_matches[-1] += [p]
    else:
        all_matches += [[p]]
    return '' # doesn't matter

r = re.sub(r'([A-Z0-9]+)$', custom_match, text, flags=re.M)

for m in ordered_matches:
    print(('Header{}{{}} '.format(m[0]) * (len(m)-1)).format(*m[1:]))

Output

HeaderA1 HeaderA2 HeaderA3 HeaderA4 HeaderA5 
HeaderB1 HeaderB2 
Wai Ha Lee
  • 8,598
  • 83
  • 57
  • 92
cards
  • 3,936
  • 1
  • 7
  • 25
0

IIUC you're trying to combine the Header(A|B) with the integers in the following lines. With the given output, it's probably easier to work with simple split() operations instead of re.

for group in text.split('This is ')[1:]:
    header, *lines = group.splitlines()
    print(*[header+line.split()[-1] for line in lines])

Output:

HeaderA1 HeaderA2 HeaderA3 HeaderA4 HeaderA5
HeaderB1 HeaderB2
fsimonjetz
  • 5,644
  • 3
  • 5
  • 21