Suppose I have the following string:
"<p>Hello</p>NOT<p>World</p>"
and i want to extract the words Hello
and World
I created the following script for the job
#!/usr/bin/env python
import re
string = "<p>Hello</p>NOT<p>World</p>"
match = re.findall(r"(<p>[\w\W]+</p>)", string)
print match
I'm not particularly interested in stripping < p> and < /p> so I never bothered doing it within the script.
The interpreter prints
['<p>Hello</p>NOT<p>World</p>']
so it obviously sees the first < p> and the last < /p> while disregarding the in between tags. Shouldn't findall()
return all three sets of matching strings though? (the string it prints, and the two words).
And if it shouldn't, how can i alter the code to do so?
PS: This is for a project and I found an alternative way to do what i needed to, so this is for educational reasons I guess.