How do I parse a number out of a web page that is being generated by Javascript, not HTML?

Question

I've just ordered a ControlByWeb temperature sensor. It has a built in web server. I had planned to have my home automation system poll this device periodically to use its temperature readings for various HA tasks. I'm using Python (2.7.x on Mac Mojave). I know how to get the html code of the device's web server. And I was planning to figure out how to parse the code to extract the actual temperature numbers. Simple enough. But the numbers are not actually there in the HTML! The manufacturer provides a sample page of a sample device, and while the temperatures show up while viewing the page, the numbers aren't in the page source. They're being generated (and refreshed) in real time by the page's javascript! How do I extract the three temperature numbers?!? Here's the page: http://107.1.170.22:9036.

You can use a library like Selenium or Puppeteer which will actively execute JavaScript exactly as an end-user’s browser would, but a better route might be to reverse-engineer the API calls the app itself is making and hook in that way. In this specific example, it looks like the page you've linked is calling http://107.1.170.22:9036/state.xml?time=1629304236823. All you'd need to whenever you need an up-to-date reading is to build the URL with the current timestamp (which appears to only be a cache-busting mechanism anyway), do a GET request to the resulting URL, then parse the resulting XML. — esqew, Aug 18 '21 at 16:29
Wow, thanks for the quick responses. Esqew, thanks for the hint. Frankly, I was expecting the devices documentation to provide exactly what you suggested, some sort of URL format that would return just the reading(s). But the documentation is sparse. I'll see what I come up with and post my attempt, as per martineau's guidance. — Mark G, Aug 18 '21 at 16:37
And to clarify a bit, I'm on this site enough to see how you all help others by commenting and/or "fixing" their code attempts. But before I can even try I was hoping to get pointed in the right direction. I know how to obtain the html source code of a page using python. What I was actually asking was how to obtain the "code" of a web page that a browser is actually displaying, not the page's source code, if that makes sense. — Mark G, Aug 18 '21 at 16:44
Ha, got ahead of myself. Not too impressive for my first Stack Overflow post, right? Esqew, if I use just "http://107.1.170.22:9036/state.xml" it returns just what I need. I can easily parse the temperature numbers right out of those results. Thank you so much! — Mark G, Aug 18 '21 at 16:50
Esqew, I've reread your answer several times. I see now the need for the time component: to be sure I get a "fresh" set of data and not something cached. I'll keep that in mind. I would think that if I use python to snag that XML, there would be no cache involved, but it'll be easy enough to generate a unique time stamp for each get, just to play it safe. Thanks again. — Mark G, Aug 18 '21 at 16:57
Not sure how to mark this question as answered, or how to credit the comment that answered it... — Mark G, Aug 18 '21 at 16:58
@MarkG You should know that appending the timestamp will (in most configurations) cache-bust through the entire network - there is the potential that between the client and the server there are devices that cache the content of the response for identical requests - while you can tell Python/your local network stack to not cache responses, it won't guarantee that any hop along the network won't decide to do it *for* you. — esqew, Aug 18 '21 at 16:59
Ah, excellent info. This is all happening on my LAN (the sample page/device from the manufacturer is not what I'll be polling eventually). But I've been tripped up too many times by MacOS, which is rife with all kinds of seemingly unnecessary caches, to ignore your advice. Plus there are local routers and switches involved. Time stamp it is, as there's essentially no downside to including it... Thx! — Mark G, Aug 18 '21 at 23:52

score 0 · Accepted Answer · answered Aug 18 '21 at 16:59

You could use a library like Selenium or Puppeteer which will actively execute JavaScript exactly as an end-user’s browser would. A better route might be to reverse-engineer the API calls the app itself is making and hook in that way.

In this specific example, it looks like the page you've linked is calling http://107.1.170.22:9036/state.xml?time=1629304236823. All you'd need to whenever you need an up-to-date reading is to build the URL with the current timestamp (which appears to only be a cache-busting mechanism anyway), do a GET request to the resulting URL, then parse the resulting XML to meet the requirements of your end product.

Ugh. So simple! A python search on the XML gets me my three temps! Thank you!! My HA system monitors indoor and outdoor temperature to turn on/off fans inside and citrus tree heaters outside (among other things). I was relying on a local weather website for the outdoor temp, which often wasn't reliable. Now with my new sensor I can get minute by minute updates and improve how everything works. I'm going to get a lime off that ---- tree if it's the last thing I do!! — Mark G, Aug 18 '21 at 17:53

score 0 · Answer 2 · answered Aug 19 '21 at 16:57

0

This seems to be the ticket, unless someone has a more elegant/bullet-proof way...

import re
import requests
import time

try:
    vTimeInSeconds = unicode(int(time.time()))
    vUrl = 'http://107.1.170.22:9036/state.xml?time=' + vTimeInSeconds
    vResult = requests.get(vUrl)
    vState = vResult.text
    vTempOutside = re.search('<sensor1>(.+?)</sensor1>', vState).group(1)
    vTempInside = re.search('<sensor2>(.+?)</sensor2>', vState).group(1)
    vTempMaster = re.search('<sensor3>(.+?)</sensor3>', vState).group(1)
except:
    # my error handler...

answered Aug 19 '21 at 16:57

Mark G

1
3

I think you’re almost there, but it’s not a terribly great idea to parse XML with RegExp as there are many, *many* edge cases. Python includes [an XML parser built-in](https://docs.python.org/3/library/xml.etree.elementtree.html) for this purpose. – esqew Aug 19 '21 at 17:03
Thanks, I'll check it out. I would much prefer the proper tool for parsing XML, rather than just searching for text... – Mark G Aug 20 '21 at 19:34
I have a new wrinkle that I need help with. The device I'm polling with the requests.get command periodically stops working (requiring a power on/off reset). I suspected faulty hardware, but the manufacturer's tech support thinks it's because my script is opening a connection, to get the data, but not closing that connection, which eventually "clogs up" the devices server, as the looping get commands continue to open more and more connections without ever closing any. I had no idea I had to "close" the connection a python get command creates, and so I have no idea how to do that. Thoughts? – Mark G Aug 23 '21 at 20:51
I would highly doubt this - if `requests.get()` never "closed" the connection, it wouldn't return anything to your `vResult` variable and would block the execution of the rest of the script. It's not like you're not doing anything their boilerplate HTML page isn't, and *that* doesn't "clog" the system. It's tough to say why *exactly* this would be occurring, but the "clogging" idea seems far-fetched to me. – esqew Aug 23 '21 at 21:01
I found this: https://stackoverflow.com/questions/10115126/python-requests-close-http-connection but there are a lot of options. Not sure which one is the one for my case. My script is setup such that I sometimes execute the get once, for a quick update, but the same script also has a loop that calls the requests.get every ten minutes. It seems like I should close the connection after every get, and not try to maintain just one session, because I think I could end up opening more than one session. – Mark G Aug 23 '21 at 21:02
I see what you're saying, but I wonder if it's because their web page establishes just one connection, and the repeated calls for the state.xml data do not create subsequent connections. It "counts" as just one. But my script is like opening one browser tab every ten minutes, without closing any of them, until the device maxes out. That's how their tech support explained it (thought they didn't actually analyze my code). – Mark G Aug 23 '21 at 21:08

score -1 · Answer 3 · answered Aug 18 '21 at 17:32

-1

The temperature value does exist in the HTML:

Screenshot

However, it sends a request for this value to the server after the page loads. You can check this in the dev console, the value is empty for a small period of time after page load. You'll need to use Selenium to read the page, you can load the page and introduce a delay before you read the element. You could use Selenium's built in Waits or you could simply use time.sleep(x) to wait x number of seconds before reading the html.

answered Aug 18 '21 at 17:32

Joshua Morris

11
3

Thanks Joshua! I was using Safari's Show Page Source, and also its Save As command to view the HTML, so I didn't see the numbers doing either. Which is why I panicked! Esqew showed me how to get at the temperature numbers in a handy XML format, which I've already figured out how to use to extract the three temperature numbers (using a simple python search on the XML). So this is now solved for my purposes. Gotta love Stack Overflow! – Mark G Aug 18 '21 at 17:44

How do I parse a number out of a web page that is being generated by Javascript, not HTML?

3 Answers3