Read value from web page using python

Question

I am trying to read a value in a html page into a variable in a python script. I have already figured out a way of downloading the page to a local file using urllib and could extract the value with a bash script but would like to try it in Python.

import urllib
urllib.urlretrieve('http://url.com', 'page.htm')

The page has this in it:

<div name="mainbody" style="font-size: x-large;margin:auto;width:33;">
<b><a href="w.cgi?hsn=10543">Plateau (19:01)</a></b>
<br/> Wired: 17.4
<br/>P10 Chard: 16.7
<br/>P1 P. Gris: 17.1
<br/>P20 Pinot Noir: 15.8-
<br/>Soil Temp : Error
<br/>Rainfall: 0.2<br/>
</div>

I need the 17.4 value from the Wired: line

Any suggestions?

Thanks

Martijn Pieters · Answer 1 · 2013-10-06T00:02:01.020

Start with not using urlretrieve(); you want the data, not a file.

Next, use a HTML parser. BeautifulSoup is great for extracting text from HTML.

Retrieving the page with urllib2 would be:

from urllib2 import urlopen

response = urlopen('http://url.com/')

then read the data into BeautifulSoup:

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.read(), from_encoding=response.headers.getparam('charset'))

The from_encoding part there will tell BeautifulSoup what encoding the web server told you to use for the page; if the web server did not specify this then BeautifulSoup will make an educated guess for you.

Now you can search for your data:

for line in soup.find('div', {'name': 'mainbody'}).stripped_strings:
    if 'Wired:' in line:
        value = float(line.partition('Wired:')[2])
        print value

For your demo HTML snippet that gives:

>>> for line in soup.find('div', {'name': 'mainbody'}).stripped_strings:
...     if 'Wired:' in line:
...         value = float(line.partition('Wired:')[2])
...         print value
... 
17.4

@beroe: The function the OP used has the signature `urlretrieve(url, filename)`; `page.html` is the filename the page was stored at, not part of the URL. — Martijn Pieters, Oct 04 '13 at 07:23

Adelmar · Answer 2 · 2015-04-14T06:20:43.490

This is called web scraping and there's a very popular library for doing this in Python, it's called Beautiful Soup:

http://www.crummy.com/software/BeautifulSoup/

If you'd like to do it with urllib/urllib2, you can accomplish that using regular expressions:

http://docs.python.org/2/library/re.html

Using regex, you basically use the surrounding context of your desired value as the key, then strip the key away. So in this case you might match from "Wired: " to the next newline character, then strip away the "Wired: " and the newline character.

score 0 · Answer 3 · answered Oct 04 '13 at 07:05

0

You can run through the file, line by line using find or a regular expression to check for the value(s) you need or you can consider using scrapy to retrieve and parse the link.

answered Oct 04 '13 at 07:05

Steve Barnes

27,618
6
63
73

Read value from web page using python

3 Answers3

Linked