Extracting data from HTML

Question

I am trying to scrape a website. I have been able to get the contents on the website into a string/file.

Now, I would like to search for a specific line that has something like:

<li><span class="abc">Key 1:</span>&nbsp;<span class="aom_pb">Value 1</span></li>

There is gauranteed to be only one Key 1: in the website and I need to get the Value 1. What the best way to do this. If its through regular expression, can you help me with how it should look. I havent used Regex much.

Regards, AMM

http://www.crummy.com/software/BeautifulSoup/download/2.x/documentation.html — phooji, Nov 06 '11 at 01:05
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — joshayers, Nov 06 '11 at 01:58

Raymond Hettinger · Answer 1 · 2011-11-06T21:19:17.750

Rather than use a regex, I would start by letting BeautifulSoup parse the html.

Then, you can use the built-in find functions to search for the "abc" and "aom_pb" classes.

import BeautifulSoup

soup = BeautifulSoup.BeautifulSoup(downloaded_str)
key = soup.find('span', {'class': 'abc'}).text
value = soup.find('span', {'class': 'aom_pb'}).text

If the class tag isn't unique, just loop over them until you find the right one:

for li in soup.findAll('li'):
    if li.find('span', attrs={'class': 'abc'}, text='Key 1:'):
        print li.find('span', {'class': 'aom_pb'}).text

The key point is to let a parser turn this into a tree navigation problem rather than an ill-defined text search problem.

BeautifulSoup is a single, pure python file that is easy to add to your setup. It is a popular choice. More sophisticated alternatives include html5lib and lxml. The standard library includes HTMLParser, but it is somewhat simplistic and doesn't handler ill-formed HTML very well.

The regex approach is a bit fragile, but you could try something like this (depending on how the data is usually laid-out):

>>> s = '''<li><span class="abc">Key 1:</span>&nbsp;<span class="aom_pb">Value 1</span></li>'''
>>> re.search(r'Key 1:.*?(Value .*?)<', s).group(1)
'Value 1'

what if 'abc' is not unique? or for that matter even 'aom_pb' is not unique. One thing thats guaranteed to be unique in the html is the text of Key 1 and Value 1 is the next span element. — AMM, Nov 06 '11 at 01:10
btw, is BeautifulSoup included in python by default (2.7.2) or does it have to be downloaded seperately. — AMM, Nov 06 '11 at 01:30

Acorn · Answer 2 · 2011-11-06T19:20:38.357

You should use a parser such as lxml to extract data from HTML. Using regular expressions for such a task is A Bad Idea^tm.

Lxml allows you to use XPath expressions to select elements, and in this case, the relevant "key" span can be selected using the expression //span[@class='abc' and text()='Key 1:']. This expression just searches the whole tree for span elements with classes of abc and containing the exact text Key 1:.

You can then use .getnext() on the element to get the following element that contains the data you want.

Here's how one would do it in full:

import lxml.html as lh

html = """
<html>
<head>
    <title>Test</title>
</head>
<body>
<ul>
    <li><span class="abc">Key 3:</span>&nbsp;<span class="aom_pb">Mango</span></li>
    <li><span class="abc">Key 1:</span>&nbsp;<span class="aom_pb">Pineapple</span></li>
    <li><span class="abc">Key 2:</span>&nbsp;<span class="aom_pb">Apple</span></li>
    <li><span class="abc">Key 7:</span>&nbsp;<span class="aom_pb">Peach</span></li>
</ul>
</body>
</html>
"""

tree = lh.fromstring(html)

for key_span in tree.xpath("//span[@class='abc' and text()='Key 1:']"):
    print key_span.getnext().text

Result:

Pineapple

score 2 · Answer 3 · answered Nov 06 '11 at 01:03

2

You shouldn't use regular expressions to parse HTML. There's an HTML parser module for python, aptly named HTMLParser. http://docs.python.org/library/htmlparser.html

answered Nov 06 '11 at 01:03

Dan

10,531
2
36
55

Thanks Dan, In Htmlparser, is there a way to search based on Key 1 and then go to the next element which basically will have value 1? – AMM Nov 06 '11 at 01:11

score 1 · Answer 4 · answered Nov 06 '11 at 01:20

Another approach using BeautifulSoup: loop over the <li> elements, and check the <span>s inside them.

import BeautifulSoup

downloaded_str='''
<li><span class="abc">Key 0:</span>&nbsp;<span class="aom_pb">Value 1</span></li>
<li><span class="abc">Key 1:</span>&nbsp;<span class="aom_pb">Value 1</span></li>
<li><span class="abc">Key 2:</span>&nbsp;<span class="aom_pb">Value 1</span></li>
'''

soup = BeautifulSoup.BeautifulSoup(downloaded_str)
for li in soup.findAll('li'):
    span = li.find('span', {'class': 'abc'}, recursive=False)
    if span and span.text == 'Key 1:':
        return li.find('span', {'class': 'aom_pb'}, recursive=False).text

Extracting data from HTML

4 Answers4