I have something like this <td width='370' style='border-left: 1px solid #fff;'>text I need to get</td>
and I need to get text using Python.
How should I do it? I'm quite new to such things.
I have something like this <td width='370' style='border-left: 1px solid #fff;'>text I need to get</td>
and I need to get text using Python.
How should I do it? I'm quite new to such things.
Python has a built in html parser module...
http://docs.python.org/2/library/htmlparser.html
But I'd recommend Beautiful Soup (Don't let the prehistoric looking homepage fool you, it's a very nice library.)
Alternatively you could try lxml which is also very nice.
A solution using Python xml Parser
>>> from xml.dom.minidom import parseString
>>> parseString(foo).getElementsByTagName("td")[0].firstChild.nodeValue
u'text I need to get'
A solution using BeautifulSOup
>>> import BeautifulSoup
>>> BeautifulSoup.BeautifulSoup(foo).getText()
u'text I need to get'
A solution using HTMPParser
>>> from HTMLParser import HTMLParser
>>> class MyHTMLParser(HTMLParser):
def handle_data(self, data):
print data
>>> MyHTMLParser().feed(foo)
text I need to get
A solution using Regex
>>> import re
>>> re.findall("<.*?>(.*)<.*?>",foo)[0]
'text I need to get'
Try this,
>>> html='''<td width='370' style='border-left: 1px solid #fff;'>text I need to get</td>'''
>>> from BeautifulSoup import BeautifulSoup
>>> ''.join(BeautifulSoup(html).findAll(text=True))
u'text I need to get'
>>>
This solutions using BeautifulSoup,
If not installed BeautifulSoup on your system. You can install like this sudo pip install BeautifulSoup