scraping dynamic updates of temperature sensor data from a website

Question

I wrote following python code:

from bs4 import BeautifulSoup
import urllib2

url= 'http://www.example.com'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read(),"html.parser")
freq=soup.find('div', attrs={'id':'frequenz'})
print freq

The result is:

<div id="frequenz" style="font-size:500%; font-weight: bold; width: 100%; height: 10%; margin-top: 5px; text-align: center">tempsensor</div>

When I look at this site with a web browser, the web page shows a dynamic content, not the string 'tempsensor'. The temperature value is automatically refreshed every second. So something in the web page is replacing the string 'tempsensor' with a numerical value automatically.

My problem is now: How can I get Python to show the updated numerical value? How can I obtain the value of the automatic update to tempsensor in BeautifulSoup?

"How can I evaluate the value of the variable tempsensor by python?" - I don't see any variable... — Nir Alfasi, Aug 15 '15 at 20:03
the actual url is http://www.netzfrequenz.info/charts/regelleistung — Chris Weber, Aug 15 '15 at 22:10
Ah, I see. The value you want is being updated by JavaScript. You can't just parse the HTML for it. You need to figure out the API. — Cyphase, Aug 15 '15 at 22:26
It looks like you just request `http://www.netzfrequenz.info/json/aktuell2.json?_=`. For example, http://www.netzfrequenz.info/json/aktuell2.json?_=1439677724960. — Cyphase, Aug 15 '15 at 22:28
@ChrisWeber, can you try to clarify what you want? It seems like no one is understanding. — Cyphase, Aug 16 '15 at 00:20
This was the right tip! How did you find out this function in the html-code? — Chris Weber, Aug 16 '15 at 08:11
Have you thought about the possibility, that the owner of the site does not like scraping his data with funny scripts? Why don't you ask him first? — mjay, Sep 01 '15 at 20:17

score 2 · Answer 1 · edited May 23 '17 at 10:26

Sorry No, Not possible with BeautifulSoup alone.

The problem is that BS4 is not a complete web browser. It is only an HTML parser. It doesn't parse CSS, nor Javascript.

A complete web browser does at least four things:

Connects to web servers, fetches data
Parses HTML content and CSS formatting and presents a web page
Parses Javascript content, runs it.
Provides for user interaction for things like Browser Navigation, HTML Forms and an events API for the Javascript program

Still not sure? Now look at your code. BS4 does not even include the first step, fetching the web page, to do that you had to use urllib2.

Dynamic sites usually include Javascript to run on the browser and periodically update contents. BS4 doesn't provide that, and so you won't see them, and furthermore never will by using only BS4. Why? Because item (3) above, downloading and executing the Javascript program is not happening. It would be happing in IE, Firefox, or Chrome, and that's why those work to show dynamic content while the BS4-only scraping does not show it.

PhantomJS and CasperJS provide a more mechanized browser that often can run the JavaScript codes enabling dynamic websites. But CasperJS and PhantomJS are programmed in server-side Javascript, not Python.

Apparently, some people are using a browser built into PyQt4 for these kinds of dynamic screenscaping tasks, isolating part of the DOM, and sending that to BS4 for parsing. That might allow for a Python solution.

In comments, @Cyphase suggests that the exact data you want might be available at a different URL, in which case it might be fetched and parsed with urllib2/BS4. This can be determined by careful examination of the Javascript that is running at a site, particularly you could look for setTimeout and setInterval which schedules updates, or ajax, or jQuery's .load function for fetching data from the back end. Javascripts for updates of dynamic content will usually only fetch data from back-end URLs of the same web site. If they use jQuery $('#frequenz') refers to the div, and by searching for this in the JS you may find the code that updates the div. Without jQuery the JS update would probably use document.getElementById('frequenz').

score -2 · Answer 2 · answered Aug 15 '15 at 20:18

-2

You're missing a tiny bit of code:

from bs4 import BeautifulSoup
import urllib2

url= 'http://www.example.com'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read(), 'html.parser')
freq = soup.find('div', attrs={'id':'frequenz'})
print freq.string  # Added .string

answered Aug 15 '15 at 20:18

Cyphase

11,502
2
31
32

1

freq.string gives me only the string (name) of the variable and not the value. In this case it is temperature value which is updated every second. tempsensor is only the variables name. – Chris Weber Aug 15 '15 at 22:15
Ah, I thought your question was just oddly worded. But what value are you talking about? I don't see a value anywhere. – Cyphase Aug 15 '15 at 22:21
Ooh, do you mean you have a variable in your program called `tempsensor`? If so, can you put it in a `dict`? If so, you could do `print variable_dict[freq.string]`. Let me know if that works for you; I'll update the answer if it does. – Cyphase Aug 15 '15 at 22:21
yes the html code includes this variable and i want to get the value in python. – Chris Weber Aug 15 '15 at 22:23
@ChrisWeber, where is the value in `
tempsensor
`? – Cyphase Aug 15 '15 at 22:24
"tempsensor" is the variable. when you use soup = BeautifulSoup(page.read(),"html.parser") you just get the name of the variable – Chris Weber Aug 15 '15 at 22:29
@ChrisWeber, you're not being clear. Take a look at my comments to the question; let's try to clarify what you want. – Cyphase Aug 15 '15 at 22:30

Slavi · Answer 3 · 2015-08-15T22:48:02.990

-2

This should do it:

freq.text.strip()

As in

>>> html = '<div id="frequenz" style="font-size:500%; font-weight: bold; width: 100%; height: 10%; margin-top: 5px; text-align: center">tempsensor</div>'
>>> soup = BeautifulSoup(html)
>>> soup.text.strip()
u'tempsensor'

edited Aug 15 '15 at 22:48

answered Aug 15 '15 at 20:22

Slavi

120
4
15

scraping dynamic updates of temperature sensor data from a website

3 Answers3

Sorry No, Not possible with BeautifulSoup alone.

Linked