I wanted to remove all the tags in HTML file. For that I used re module of python.
For example, consider the line <h1>Hello World!</h1>
.I want to retain only "Hello World!". In order to remove the tags, I used re.sub('<.*>','',string)
. For obvious reasons the result I get is an empty string (The regexp identifies the first and last angle brackets and removes everything in between). How could I get over this issue?

- 1,161
- 3
- 16
- 31
5 Answers
Parse the HTML using BeautifulSoup, then only retrieve the text.

- 5,007
- 6
- 34
- 51
-
Is BeatifulSoup a module in python? or What is it? – PaulDaviesC Oct 08 '11 at 03:50
You can make the match non-greedy: '<.*?>'
You also need to be careful, HTML is a crafty beast, and can thwart your regexes.

- 364,293
- 75
- 561
- 662
make it non-greedy: http://docs.python.org/release/2.6/howto/regex.html#greedy-versus-non-greedy
off-topic: the approach that uses regular expressions is error prone. it cannot handle cases when angle brackets do not represent tags. I recommend http://lxml.de/

- 28,824
- 33
- 119
- 194
Use a parser, either lxml or BeautifulSoup:
import lxml.html
print lxml.html.fromstring(mystring).text_content()
Related questions:
Using regular expressions to parse HTML: why not?
Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms

- 1
- 1

- 13,556
- 6
- 39
- 55
Beautiful Soup is great for parsing html!
You might not require it now, but it's worth learning to use it. Will help you in the future too.

- 19,499
- 5
- 29
- 47