Python regex with multiple matches in the same string

Question

test = '<tag>part1</tag><tag can have random stuff here>part2</tag>'
print(re.findall("<tag.*>(.*)</tag>", test))

It outputs:

['part2']

The text can have any amount of "parts". I want to return all of them, not only the last one. What's the best way to do it?

It looks like you're trying to parse HTML with regular expressions... https://stackoverflow.com/a/1732454/3001761 — jonrsharpe, May 22 '19 at 14:57
One way I thought of doing this is making a copy of the string, then erasing all matches of and then of , but I believe there's a better way to do this — potatosalad, May 22 '19 at 15:01
The reason you're catching just one of the parts, is because you're using `*`, which is greedy. If you instead change the first `.*` to `.*?`, then the `?` modifier will make it non-greedy, which could do what you're trying to accomplish. But as @jonrsharpe is pointing out, please don't use RegEx as a parsing-method for HTML. — Hampus Larsson, May 22 '19 at 15:02

score 1 · Accepted Answer · answered May 22 '19 at 15:03

You could change your .* to be .*? so that they are non-greedy. That will make your original example work:

import re

test = '<tag>part1</tag><tag can have random stuff here>part2</tag>'
print(re.findall(r'<tag.*?>(.*?)</tag>', test))

Output:

['part1', 'part2']

Though it would probably be best to not try to parse this with just regex, but instead use a proper HTML parser library.

1 Answers1