Please explain how re module works in this case, re.sub() and re.findall seem to give different matches

Question

I'm learning RegEx in Python and have faced this problem. Assume I have a variable called s:

>>>print(repr(s))
'HTML elements include\n\n* headings\n* paragraphs\n* lists\n* links\n* and more\n\nTry it!!!'

I want to match '* headings\n* paragraphs\n* lists\n* links\n* and more\n' part of s (start with *, end with \n, happen as much as possible), so my code is:

>>>print(re.findall(r'(\*.+?\n)+', s))
['* and more!\n']

I don't understand why just the last pattern is matched. But when I use re.sub() instead, the whole pattern is replaced.

>>> print(re.sub(r'(\*.+?\n)+', 'text', s))
HTML elements include

text
Try it!!!

This shows that the re.sub() matches the right pattern I want. So I'm really confused why I get this. Thanks for your time.

`re.findall()`, when given a pattern that contains a single capturing group, *returns only that group for each match* - on the assumption that you wouldn't have defined a capturing group if you weren't particularly interested in its contents. And a capturing group with a repetition operator applied to it only captures the final repetition. Instead of `(`...`)`, try a non capturing group: `(?:`...`)`. — jasonharper, Sep 25 '20 at 02:41
@jasonharper Thanks very much. It helps. I didn't read the docs carefully. — Khai Hoan Pham, Sep 25 '20 at 05:31

score 0 · Answer 1 · answered Sep 26 '20 at 23:23

The following regex matches what you want to achieve:

import re

desired_output = r'* headings\n* paragraphs\n* lists\n* links\n* and more\n'

s = r'HTML elements include\n\n* headings\n* paragraphs\n* lists\n* links\n* and more\n\nTry it!!!'

pattern = re.compile(r'n(\*.+)\\')

match = re.search(pattern, s).group(1)
print(match)

assert match == desired_output

Please explain how re module works in this case, re.sub() and re.findall seem to give different matches

1 Answers1