Parse text in HTML document using Python

Question

I have something like this <td width='370' style='border-left: 1px solid #fff;'>text I need to get</td> and I need to get text using Python.

How should I do it? I'm quite new to such things.

related. http://stackoverflow.com/questions/1838637/html-agility-pack-for-python — naveen, Dec 27 '12 at 15:14

score 2 · Answer 1 · answered Dec 27 '12 at 15:14

2

I personally love BeautifulSoup.

answered Dec 27 '12 at 15:14

Jill-Jênn Vie

1,849
19
22

score 0 · Answer 2 · answered Dec 27 '12 at 15:17

Python has a built in html parser module...

http://docs.python.org/2/library/htmlparser.html

But I'd recommend Beautiful Soup (Don't let the prehistoric looking homepage fool you, it's a very nice library.)

Alternatively you could try lxml which is also very nice.

Abhijit · Answer 3 · 2012-12-27T15:23:37.223

0

A solution using Python xml Parser

>>> from xml.dom.minidom import parseString
>>> parseString(foo).getElementsByTagName("td")[0].firstChild.nodeValue
u'text I need to get'

A solution using BeautifulSOup

>>> import BeautifulSoup
>>> BeautifulSoup.BeautifulSoup(foo).getText()
u'text I need to get'

A solution using HTMPParser

>>> from HTMLParser import HTMLParser
>>> class MyHTMLParser(HTMLParser):
    def handle_data(self, data):
        print data          
>>> MyHTMLParser().feed(foo)
text I need to get

A solution using Regex

>>> import re
>>> re.findall("<.*?>(.*)<.*?>",foo)[0]
'text I need to get'

edited Dec 27 '12 at 15:23

answered Dec 27 '12 at 15:18

Abhijit

62,056
18
131
204

Thanks for the answer, but I don't need all the text, just what follows that specific piece of HTML. – Mike Dec 27 '12 at 18:17

score 0 · Answer 4 · answered Dec 27 '12 at 15:49

0

Try this,

 >>> html='''<td width='370' style='border-left: 1px solid #fff;'>text I need to get</td>'''
 >>> from BeautifulSoup import BeautifulSoup
 >>> ''.join(BeautifulSoup(html).findAll(text=True))
 u'text I need to get'
 >>>

This solutions using BeautifulSoup,

If not installed BeautifulSoup on your system. You can install like this sudo pip install BeautifulSoup

answered Dec 27 '12 at 15:49

Adem Öztaş

20,457
4
34
42

I only need the text after that specific HTML, not all the text. – Mike Dec 27 '12 at 18:20

Parse text in HTML document using Python

4 Answers4