1

I am trying to scrape a website. I have been able to get the contents on the website into a string/file.

Now, I would like to search for a specific line that has something like:

<li><span class="abc">Key 1:</span>&nbsp;<span class="aom_pb">Value 1</span></li>

There is gauranteed to be only one Key 1: in the website and I need to get the Value 1. What the best way to do this. If its through regular expression, can you help me with how it should look. I havent used Regex much.

Regards, AMM

Acorn
  • 49,061
  • 27
  • 133
  • 172
AMM
  • 17,130
  • 24
  • 65
  • 77

4 Answers4

5

Rather than use a regex, I would start by letting BeautifulSoup parse the html.

Then, you can use the built-in find functions to search for the "abc" and "aom_pb" classes.

import BeautifulSoup

soup = BeautifulSoup.BeautifulSoup(downloaded_str)
key = soup.find('span', {'class': 'abc'}).text
value = soup.find('span', {'class': 'aom_pb'}).text 

If the class tag isn't unique, just loop over them until you find the right one:

for li in soup.findAll('li'):
    if li.find('span', attrs={'class': 'abc'}, text='Key 1:'):
        print li.find('span', {'class': 'aom_pb'}).text

The key point is to let a parser turn this into a tree navigation problem rather than an ill-defined text search problem.

BeautifulSoup is a single, pure python file that is easy to add to your setup. It is a popular choice. More sophisticated alternatives include html5lib and lxml. The standard library includes HTMLParser, but it is somewhat simplistic and doesn't handler ill-formed HTML very well.

The regex approach is a bit fragile, but you could try something like this (depending on how the data is usually laid-out):

>>> s = '''<li><span class="abc">Key 1:</span>&nbsp;<span class="aom_pb">Value 1</span></li>'''
>>> re.search(r'Key 1:.*?(Value .*?)<', s).group(1)
'Value 1'
Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
  • what if 'abc' is not unique? or for that matter even 'aom_pb' is not unique. One thing thats guaranteed to be unique in the html is the text of Key 1 and Value 1 is the next span element. – AMM Nov 06 '11 at 01:10
  • btw, is BeautifulSoup included in python by default (2.7.2) or does it have to be downloaded seperately. – AMM Nov 06 '11 at 01:30
  • BeautifulSoup is not part of the Python standard library. – Acorn Nov 06 '11 at 01:33
4

You should use a parser such as lxml to extract data from HTML. Using regular expressions for such a task is A Bad Ideatm.

Lxml allows you to use XPath expressions to select elements, and in this case, the relevant "key" span can be selected using the expression //span[@class='abc' and text()='Key 1:']. This expression just searches the whole tree for span elements with classes of abc and containing the exact text Key 1:.

You can then use .getnext() on the element to get the following element that contains the data you want.

Here's how one would do it in full:

import lxml.html as lh

html = """
<html>
<head>
    <title>Test</title>
</head>
<body>
<ul>
    <li><span class="abc">Key 3:</span>&nbsp;<span class="aom_pb">Mango</span></li>
    <li><span class="abc">Key 1:</span>&nbsp;<span class="aom_pb">Pineapple</span></li>
    <li><span class="abc">Key 2:</span>&nbsp;<span class="aom_pb">Apple</span></li>
    <li><span class="abc">Key 7:</span>&nbsp;<span class="aom_pb">Peach</span></li>
</ul>
</body>
</html>
"""

tree = lh.fromstring(html)

for key_span in tree.xpath("//span[@class='abc' and text()='Key 1:']"):
    print key_span.getnext().text

Result:

Pineapple

Acorn
  • 49,061
  • 27
  • 133
  • 172
2

You shouldn't use regular expressions to parse HTML. There's an HTML parser module for python, aptly named HTMLParser. http://docs.python.org/library/htmlparser.html

Dan
  • 10,531
  • 2
  • 36
  • 55
  • Thanks Dan, In Htmlparser, is there a way to search based on Key 1 and then go to the next element which basically will have value 1? – AMM Nov 06 '11 at 01:11
1

Another approach using BeautifulSoup: loop over the <li> elements, and check the <span>s inside them.

import BeautifulSoup

downloaded_str='''
<li><span class="abc">Key 0:</span>&nbsp;<span class="aom_pb">Value 1</span></li>
<li><span class="abc">Key 1:</span>&nbsp;<span class="aom_pb">Value 1</span></li>
<li><span class="abc">Key 2:</span>&nbsp;<span class="aom_pb">Value 1</span></li>
'''

soup = BeautifulSoup.BeautifulSoup(downloaded_str)
for li in soup.findAll('li'):
    span = li.find('span', {'class': 'abc'}, recursive=False)
    if span and span.text == 'Key 1:':
        return li.find('span', {'class': 'aom_pb'}, recursive=False).text
Petr Viktorin
  • 65,510
  • 9
  • 81
  • 81