0
test = '<tag>part1</tag><tag can have random stuff here>part2</tag>'
print(re.findall("<tag.*>(.*)</tag>", test))

It outputs:

['part2']

The text can have any amount of "parts". I want to return all of them, not only the last one. What's the best way to do it?

Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
  • 2
    It looks like you're trying to parse HTML with regular expressions... https://stackoverflow.com/a/1732454/3001761 – jonrsharpe May 22 '19 at 14:57
  • One way I thought of doing this is making a copy of the string, then erasing all matches of and then of , but I believe there's a better way to do this – potatosalad May 22 '19 at 15:01
  • 1
    The reason you're catching just one of the parts, is because you're using `*`, which is greedy. If you instead change the first `.*` to `.*?`, then the `?` modifier will make it non-greedy, which could do what you're trying to accomplish. But as @jonrsharpe is pointing out, please don't use RegEx as a parsing-method for HTML. – Hampus Larsson May 22 '19 at 15:02

1 Answers1

1

You could change your .* to be .*? so that they are non-greedy. That will make your original example work:

import re

test = '<tag>part1</tag><tag can have random stuff here>part2</tag>'
print(re.findall(r'<tag.*?>(.*?)</tag>', test))

Output:

['part1', 'part2']

Though it would probably be best to not try to parse this with just regex, but instead use a proper HTML parser library.

ruohola
  • 21,987
  • 6
  • 62
  • 97