The html content that I'm trying to scrape only appears to load when I navigate to a certain anchor within the site

Question

I'm trying to scrape a certain value off the following website: https://www.theice.com/productguide/ProductSpec.shtml?specId=6747556#data

Specifically, I'm trying to grab the "last" value from the table at the bottom of the page in the table with class "data default borderless". The issue is that when I search for that object name, nothing appears.

The code I use is as follows:

from bs4 import BeautifulSoup
import urllib2
url = "https://www.theice.com/productguide/ProductSpec.shtml?specId=6747556#data"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
result = soup.findAll(attrs={"class":"data default borderless"})
print result

One issue I noticed is that when I pull the soup for that URL, it strips off the anchor tag and shows me the html for the url: https://www.theice.com/productguide/ProductSpec.shtml?specId=6747556

It was my understanding that anchor tags just navigate you around the page but all the HTML should be there regardless, so I'm wondering if this table somehow doesn't load unless you've navigated to the "data" section of the webpage.

Does anyone know how to force the table to load before I pull the soup? Is there something else I'm doing wrong that prevents me from seeing the table?

Thanks in advance!

score 2 · Accepted Answer · answered Nov 19 '13 at 03:28

The content is dynamically generated via below js:

<script type="text/javascript">
        var app = {};
        app.isOption = false;
        app.urls = {
            'spec':'/productguide/ProductSpec.shtml?details=&specId=6747556',
            'data':'/productguide/ProductSpec.shtml?data=&specId=6747556',
            'confirm':'/reports/dealreports/getSampleConfirm.do?hubId=4080&productId=3418',
            'reports':'/productguide/ProductSpec.shtml?reports=&specId=6747556',
            'expiry':'/productguide/ProductSpec.shtml?expiryDates=&specId=6747556'
        };
        app.Router = Backbone.Router.extend({
            routes:{
                "spec":"spec",
                "data":"data",
                "confirm":"confirm",
                "reports":"reports",
                "expiry":"expiry"
            },
            initialize: function(){
                _.bindAll(this, "spec");
            },
            spec:function () {
                this.navigate("");
                this._loadPage('spec');
            },
            data:function () {
                this._loadPage('data');
            },
            confirm:function () {
                this._loadPage('confirm');
            },
            reports:function () {
                this._loadPage('reports');
            },
            expiry:function () {
                this._loadPage('expiry');
            },
            _loadPage:function (cssClass, cb) {
                $('#right').html('Loading..').load(this._makeUrlUnique(app.urls[cssClass]), cb);
                this._updateNav(cssClass);
            },
            _updateNav:function (cssClass) {
                // the left bar gets hidden on margin rates because the tables get smashed up too much
                // so ensure they're showing for the other links
                $('#left').show();
                $('#right').removeClass('wide');
                // update the subnav css so the arrow points to the right location
                $('#subnav ul li a.' + cssClass).siblings().removeClass('on').end().addClass('on');
            },
            _makeUrlUnique:function (urlString) {
                return urlString + '&_=' + new Date().getTime();
            }
        });

        // init and start the app
        $(function () {
            window.router = new app.Router();
            Backbone.history.start();
        });
    </script>

Two things you can do:1. figuring out the real path and variables it uses to pull the data, see this part 'data':'/productguide/ProductSpec.shtml?data=&specId=6747556', it passes a variable to the data string and get the content. 2. use the rss feed they provided and construct your own table.

Thank you for the suggestions. How complicated is the process of figuring out the real path and variables? Is this something I can reasonably figure out without a deep understanding of js? How would I implement it in a python program? — TBK, Nov 19 '13 at 17:22
Not complicated at all if you know how to use Firebug or Fiddler, here you go I just got the real URL for your specific page:https://www.theice.com/marketdata/DelayedMarkets.shtml?productId=3418&hubId=4080 — Godinall, Nov 19 '13 at 17:34
This page contains only your required data in that table so would be easy for you to use your soup script to fetch. — Godinall, Nov 19 '13 at 17:35
This works perfectly, and I installed firebug to reproduce the process myself. Thanks! — TBK, Nov 19 '13 at 18:37

score 1 · Answer 2 · edited May 23 '17 at 11:50

1

the table is generated by JavaScript and you cant get it without actually loading the page in your browser

or you could use Selenium to load the page then evaluate the JavaScript and html, But Selenium will bring up and window so its visible but you can use Phantom.JS which makes the browser headless

But yes you will need to load the actual js in a browser to get the HTML is generates

Take a look at this answer also

Good Luck!

edited May 23 '17 at 11:50

Community

1
1

answered Nov 19 '13 at 03:22

Serial

7,925
13
52
71

Thanks. I'm trying Selenium today though having some trouble getting the chrome plugin to install correctly – TBK Nov 19 '13 at 17:24

score 0 · Answer 3 · answered Nov 19 '13 at 03:14

The HTML is generated using Javascript, so BeautifulSoup won't be able to get the HTML for that table (and actually the whole <div id="right" class="main"> is loaded using Javascript, I guess they're using node.js)

You can check this by printing the value of soup.get_text(). You can see that the table is not there in the source.

In that case, there is no way for you to access the data, unless you use Javascript to do exactly what the script do to get the data from the server.

The html content that I'm trying to scrape only appears to load when I navigate to a certain anchor within the site

3 Answers3