0

I'm learning RegEx in Python and have faced this problem. Assume I have a variable called s:

>>>print(repr(s))
'HTML elements include\n\n* headings\n* paragraphs\n* lists\n* links\n* and more\n\nTry it!!!'

I want to match '* headings\n* paragraphs\n* lists\n* links\n* and more\n' part of s (start with *, end with \n, happen as much as possible), so my code is:

>>>print(re.findall(r'(\*.+?\n)+', s))
['* and more!\n']

I don't understand why just the last pattern is matched. But when I use re.sub() instead, the whole pattern is replaced.

>>> print(re.sub(r'(\*.+?\n)+', 'text', s))
HTML elements include

text
Try it!!!

This shows that the re.sub() matches the right pattern I want. So I'm really confused why I get this. Thanks for your time.

Khai Hoan Pham
  • 86
  • 1
  • 2
  • 7
  • 1
    `re.findall()`, when given a pattern that contains a single capturing group, *returns only that group for each match* - on the assumption that you wouldn't have defined a capturing group if you weren't particularly interested in its contents. And a capturing group with a repetition operator applied to it only captures the final repetition. Instead of `(`...`)`, try a non capturing group: `(?:`...`)`. – jasonharper Sep 25 '20 at 02:41
  • @jasonharper Thanks very much. It helps. I didn't read the docs carefully. – Khai Hoan Pham Sep 25 '20 at 05:31

1 Answers1

0

The following regex matches what you want to achieve:

import re

desired_output = r'* headings\n* paragraphs\n* lists\n* links\n* and more\n'

s = r'HTML elements include\n\n* headings\n* paragraphs\n* lists\n* links\n* and more\n\nTry it!!!'

pattern = re.compile(r'n(\*.+)\\')

match = re.search(pattern, s).group(1)
print(match)

assert match == desired_output
Jose Mir
  • 78
  • 7