Processing a HTML file using Python

Question

I wanted to remove all the tags in HTML file. For that I used re module of python. For example, consider the line <h1>Hello World!</h1>.I want to retain only "Hello World!". In order to remove the tags, I used re.sub('<.*>','',string). For obvious reasons the result I get is an empty string (The regexp identifies the first and last angle brackets and removes everything in between). How could I get over this issue?

score 1 · Answer 1 · answered Oct 08 '11 at 03:36

1

Parse the HTML using BeautifulSoup, then only retrieve the text.

answered Oct 08 '11 at 03:36

Sunjay Varma

5,007
6
34
51

Is BeatifulSoup a module in python? or What is it? – PaulDaviesC Oct 08 '11 at 03:50

Ned Batchelder · Accepted Answer · 2011-10-08T03:45:06.917

1

You can make the match non-greedy: '<.*?>'

You also need to be careful, HTML is a crafty beast, and can thwart your regexes.

edited Oct 08 '11 at 03:45

answered Oct 08 '11 at 03:38

Ned Batchelder

364,293
75
561
662

score 1 · Answer 3 · answered Oct 08 '11 at 03:39

make it non-greedy: http://docs.python.org/release/2.6/howto/regex.html#greedy-versus-non-greedy

off-topic: the approach that uses regular expressions is error prone. it cannot handle cases when angle brackets do not represent tags. I recommend http://lxml.de/

score 1 · Answer 4 · edited May 23 '17 at 12:27

1

Use a parser, either lxml or BeautifulSoup:

import lxml.html
print lxml.html.fromstring(mystring).text_content()

Related questions:

Using regular expressions to parse HTML: why not?

Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms

edited May 23 '17 at 12:27

Community

1
1

answered Oct 08 '11 at 03:55

Marco Mariani

13,556
6
39
55

score 0 · Answer 5 · answered Oct 08 '11 at 06:22

0

Beautiful Soup is great for parsing html!

You might not require it now, but it's worth learning to use it. Will help you in the future too.

answered Oct 08 '11 at 06:22

varunl

19,499
5
29
47

Processing a HTML file using Python

5 Answers5

Linked