-3

Suppose I have the following string:

"<p>Hello</p>NOT<p>World</p>"

and i want to extract the words Hello and World

I created the following script for the job

#!/usr/bin/env python

import re

string = "<p>Hello</p>NOT<p>World</p>"
match = re.findall(r"(<p>[\w\W]+</p>)", string)

print match

I'm not particularly interested in stripping < p> and < /p> so I never bothered doing it within the script.

The interpreter prints

['<p>Hello</p>NOT<p>World</p>']

so it obviously sees the first < p> and the last < /p> while disregarding the in between tags. Shouldn't findall() return all three sets of matching strings though? (the string it prints, and the two words).

And if it shouldn't, how can i alter the code to do so?

PS: This is for a project and I found an alternative way to do what i needed to, so this is for educational reasons I guess.

persongr
  • 11
  • 5
  • 1
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – BrenBarn Apr 16 '16 at 19:28

1 Answers1

1

The reason that you get the entire contents in a single match is because [\w\W]+ will match as many things as it can (including all of your <p> and </p> tags). To prevent this, you want to use the non-greedy version by appending a ?.

match = re.findall(r"(<p>[\w\W]+?</p>)", string)
# ['<p>Hello</p>', '<p>World</p>']

From the documentation:

*?, +?, ??
The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against <a> b <c>, it will match the entire string, and not just <a>. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using the RE <.*?> will match only <a>.

If you don't want the <p> and </p> tags in the result, you will want to use look-ahead and look behind assertions to not include them in the result.

match = re.findall(r"((?<=<p>)\w+?(?=</p>))", string)
# ['Hello', 'World']

As a side note though, if you are trying to parse HTML or XML with regular expressions, it is preferable to use a library such as BeautifulSoup which is intended for parsing HTML.

Suever
  • 64,497
  • 14
  • 82
  • 101
  • Thank you very much. I guess I overlooked that part of REs – persongr Apr 16 '16 at 19:21
  • I will look into BeautifulSoup as well, thanks for the suggestion. – persongr Apr 16 '16 at 19:33
  • +1 for BeautifulSoup (or similar). HTML is not a regular language, so regular expressions are not a good tool for parsing them. It will be much easier to just use a library that understands HTML. – nighthawk454 Apr 16 '16 at 20:22