4

I am trying to read a value in a html page into a variable in a python script. I have already figured out a way of downloading the page to a local file using urllib and could extract the value with a bash script but would like to try it in Python.

import urllib
urllib.urlretrieve('http://url.com', 'page.htm')

The page has this in it:

<div name="mainbody" style="font-size: x-large;margin:auto;width:33;">
<b><a href="w.cgi?hsn=10543">Plateau (19:01)</a></b>
<br/> Wired: 17.4
<br/>P10 Chard: 16.7
<br/>P1 P. Gris: 17.1
<br/>P20 Pinot Noir: 15.8-
<br/>Soil Temp : Error
<br/>Rainfall: 0.2<br/>
</div>

I need the 17.4 value from the Wired: line

Any suggestions?

Thanks

user2845506
  • 51
  • 1
  • 1
  • 4

3 Answers3

4

Start with not using urlretrieve(); you want the data, not a file.

Next, use a HTML parser. BeautifulSoup is great for extracting text from HTML.

Retrieving the page with urllib2 would be:

from urllib2 import urlopen

response = urlopen('http://url.com/')

then read the data into BeautifulSoup:

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.read(), from_encoding=response.headers.getparam('charset'))

The from_encoding part there will tell BeautifulSoup what encoding the web server told you to use for the page; if the web server did not specify this then BeautifulSoup will make an educated guess for you.

Now you can search for your data:

for line in soup.find('div', {'name': 'mainbody'}).stripped_strings:
    if 'Wired:' in line:
        value = float(line.partition('Wired:')[2])
        print value

For your demo HTML snippet that gives:

>>> for line in soup.find('div', {'name': 'mainbody'}).stripped_strings:
...     if 'Wired:' in line:
...         value = float(line.partition('Wired:')[2])
...         print value
... 
17.4
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • @beroe: The function the OP used has the signature `urlretrieve(url, filename)`; `page.html` is the filename the page was stored at, not part of the URL. – Martijn Pieters Oct 04 '13 at 07:23
4

This is called web scraping and there's a very popular library for doing this in Python, it's called Beautiful Soup:

http://www.crummy.com/software/BeautifulSoup/

If you'd like to do it with urllib/urllib2, you can accomplish that using regular expressions:

http://docs.python.org/2/library/re.html

Using regex, you basically use the surrounding context of your desired value as the key, then strip the key away. So in this case you might match from "Wired: " to the next newline character, then strip away the "Wired: " and the newline character.

Adelmar
  • 2,073
  • 2
  • 20
  • 20
0

You can run through the file, line by line using find or a regular expression to check for the value(s) you need or you can consider using scrapy to retrieve and parse the link.

Steve Barnes
  • 27,618
  • 6
  • 63
  • 73