findall() behaviour (python 2.7)

Question

Suppose I have the following string:

"<p>Hello</p>NOT<p>World</p>"

and i want to extract the words Hello and World

I created the following script for the job

#!/usr/bin/env python

import re

string = "<p>Hello</p>NOT<p>World</p>"
match = re.findall(r"(<p>[\w\W]+</p>)", string)

print match

I'm not particularly interested in stripping and so I never bothered doing it within the script.

The interpreter prints

['<p>Hello</p>NOT<p>World</p>']

so it obviously sees the first and the last while disregarding the in between tags. Shouldn't findall() return all three sets of matching strings though? (the string it prints, and the two words).

And if it shouldn't, how can i alter the code to do so?

PS: This is for a project and I found an alternative way to do what i needed to, so this is for educational reasons I guess.

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — BrenBarn, Apr 16 '16 at 19:28

Suever · Accepted Answer · 2016-04-16T19:22:05.220

The reason that you get the entire contents in a single match is because [\w\W]+ will match as many things as it can (including all of your  and  tags). To prevent this, you want to use the non-greedy version by appending a ?.

match = re.findall(r"(<p>[\w\W]+?</p>)", string)
# ['<p>Hello</p>', '<p>World</p>']

From the documentation:

*?, +?, ??
The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against <a> b <c>, it will match the entire string, and not just <a>. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using the RE <.*?> will match only <a>.

If you don't want the  and  tags in the result, you will want to use look-ahead and look behind assertions to not include them in the result.

match = re.findall(r"((?<=<p>)\w+?(?=</p>))", string)
# ['Hello', 'World']

As a side note though, if you are trying to parse HTML or XML with regular expressions, it is preferable to use a library such as BeautifulSoup which is intended for parsing HTML.

I will look into BeautifulSoup as well, thanks for the suggestion. — persongr, Apr 16 '16 at 19:33
+1 for BeautifulSoup (or similar). HTML is not a regular language, so regular expressions are not a good tool for parsing them. It will be much easier to just use a library that understands HTML. — nighthawk454, Apr 16 '16 at 20:22

findall() behaviour (python 2.7)

1 Answers1