Grabbing non-HTML data from a website using python

Question

I'm trying to get the current contract prices on this page to a string: http://www.cmegroup.com/trading/equity-index/us-index/e-mini-sandp500.html

I would really like a python 2.6 solution.

It was easy to get the page html using urllib, but it seems like this number is live and not in the html. I inspected the element in Chrome and it's some td class thing.

But I don't know how to get at this with python. I tried beautifulsoup (but after several attempts gave up getting a tar.gz to work on my windows x64 system), and then elementtree, but really my programming interest is data analysis. I'm not a website designer and don't really want to become one, so it's all kind of a foreign language. Is this live price XML?

Any assistance gratefully received. Ideally a simple to install module and some actual code, but all hints and tips very welcome.

score 2 · Answer 1 · answered Dec 19 '10 at 04:19

2

It looks like the numbers in the table are filled in by Javascript, so just fetching the HTML with urllib or another library won't be enough since they don't run the javascript. You'll need to use a library like PyQt to simulate the browser rendering the page/executing the JS to fill in the numbers, then scrape the output HTML of that.

See this blog post on working with PyQt: http://blog.motane.lu/2009/07/07/downloading-a-pages-content-with-python-and-webkit/link text

answered Dec 19 '10 at 04:19

David

671
5
10

Great, will take a look and report back. Thanks. – JackSprat Dec 19 '10 at 04:20
You could also do what Mark suggests below and just make a call to the URL to fetch the JSON directly. However, I know that when I write code for emitting JSON meant only for use on my site, I check to make sure that it's an AJAX call, so this may or may not work. And Mark's right - this probably almost certainly is not cool with the site-owner. – David Dec 19 '10 at 21:31

score 1 · Answer 2 · answered Dec 19 '10 at 16:49

1

If you look at that website with something like firebug, you can see the AJAX calls it's making. For instance the initial values are being filled in with a AJAX call (at least for me) to:

http://www.cmegroup.com/CmeWS/md/MDServer/V1/Venue/G/Exchange/XCME/FOI/FUT/Product/ES?currentTime=1292780678142&contractCDs=,ESH1,ESM1,ESU1,ESZ1,ESH2,ESH1,ESM1,ESU1,ESZ1,ESH2

This is returning a JSON response, which is then parsed by javascript to fill in the tabel. It would be pretty simple to do that yourself with urllib and then use simplejson to parse the response.

Also, you should read this disclaimer very carefully. What you are trying to do is probably not cool with the owners of the web-site.

answered Dec 19 '10 at 16:49

Mark

106,305
20
172
230

1

"However, deep linking to this domain is not allowed without CME Group's written consent." Stupid, isn't it. Better remove your link to /disclaimer.html ;-) – Chris Morgan Dec 19 '10 at 23:03
Hi, this will work with a standard urllib read, and then I can parse it using my basic py toolkit. Shame it's not allowed... Given that I'm (a) in a western country but not America, and (b) using it for private purposes only, what are the odds that a hypothetical website owner might get sniffy about this in a way that would actually bug me? – JackSprat Dec 20 '10 at 08:25

score 0 · Answer 3 · answered Dec 19 '10 at 04:05

0

Its hard to know what to tell you wothout knowing where the number is coming from. It could be php or asp also, so you are going to have to figure out which language the number is in.

answered Dec 19 '10 at 04:05

finfet

226
3
5
13

Exactly, that's my problem and why it's hard to formulate a better question: is it XML, PHP, ASP...I have no idea. It's right there if anyone wants to take a look... – JackSprat Dec 19 '10 at 04:08
It is most likely php, but they also have alot of javascript on their page so I would check that too. – finfet Dec 19 '10 at 04:20

Grabbing non-HTML data from a website using python

3 Answers3

Linked