The information in this page is being supplied by a JavaScript function. When you download the page with urllib
you get the page before the JavaScript is executed. When you view the page in a standard browser manually, you see the HTML after the JavaScript has been executed.
To get at the data programmatically, you need to use some tool that can execute JavaScript. There are a number of 3rd party options available for Python, such as selenium, WebKit, or spidermonkey.
Here is an example of how to scrape the page using selenium (with phantomjs) and lxml:
import contextlib
import selenium.webdriver as webdriver
import lxml.html as LH
link = 'https://www.theice.com/productguide/ProductSpec.shtml;jsessionid=7A651D7E9437F76904BEC5623DBAB055?specId=19118104#expiry'
with contextlib.closing(webdriver.PhantomJS('phantomjs')) as driver:
driver.get(link)
content = driver.page_source
doc = LH.fromstring(content)
tds = doc.xpath(
'//table[@class="default"]//tr[@class="odd" or @class="even"]/td/text()')
print('\n'.join(map(str, zip(*[iter(tds)]*5))))
yields
('Jul13', '2/11/13', '7/26/13', '7/26/13', '7/26/13')
('Aug13', '2/11/13', '8/30/13', '8/30/13', '8/30/13')
('Sep13', '2/11/13', '9/27/13', '9/27/13', '9/27/13')
('Oct13', '2/11/13', '10/25/13', '10/25/13', '10/25/13')
...
('Aug18', '2/11/13', '8/31/18', '8/31/18', '8/31/18')
('Sep18', '2/11/13', '9/28/18', '9/28/18', '9/28/18')
('Oct18', '2/11/13', '10/26/18', '10/26/18', '10/26/18')
('Nov18', '2/11/13', '11/30/18', '11/30/18', '11/30/18')
('Dec18', '2/11/13', '12/28/18', '12/28/18', '12/28/18')
Explanation of the XPath:
lxml
allows you to select tags using XPath.
The XPath
'//table[@class="default"]//tr[@class="odd" or @class="even"]/td/text()'
means
//table # search recursively for <table>
[@class="default"] # with an attribute class="default"
//tr # and find inside <table> all <tr> tags
[@class="odd" or @class="even"] # that have attribute class="odd" or class="even"
/td # find the <td> tags which are direct children of the <tr> tags
/text() # return the text inside the <td> tag
Explanation of zip(*[iter(tds)]*5)
:
The tds
is a list. It looks something like
['Jul13', '2/11/13', '7/26/13', '7/26/13', '7/26/13', 'Aug13', '2/11/13', '8/30/13', '8/30/13', '8/30/13',...]
Notice that each row of the table consists of 5 items. But our list is flat. So, to group every 5 items together into a tuple, we can use the grouper recipe. zip(*[iter(tds)]*5)
is an application of the grouper recipe. It takes a flat list, like tds
, and turns it into a list of tuples with every 5 items grouped together.
Here is an explanation of how the grouper recipe works. Please read that and if you have any question about it, I'll be glad to try to answer.
To get just the first column of the table, change the XPath to:
tds = doc.xpath(
'''//table[@class="default"]
//tr[@class="odd" or @class="even"]
/td[1]/text()''')
print(tds)
For example,
import contextlib
import selenium.webdriver as webdriver
import lxml.html as LH
link = 'https://www.theice.com/productguide/ProductSpec.shtml?specId=6753474#expiry'
with contextlib.closing(webdriver.PhantomJS('phantomjs')) as driver:
driver.get(link)
content = driver.page_source
doc = LH.fromstring(content)
tds = doc.xpath(
'''//table[@class="default"]
//tr[@class="odd" or @class="even"]
/td[1]/text()''')
print(tds)
yields
['Jul13', 'Aug13', 'Sep13', 'Oct13', 'Nov13', 'Dec13', 'Jan14', 'Feb14', 'Mar14', 'Apr14', 'May14', 'Jun14', 'Jul14', 'Aug14', 'Sep14', 'Oct14', 'Nov14', 'Dec14', 'Jan15', 'Feb15', 'Mar15', 'Apr15', 'May15', 'Jun15', 'Jul15', 'Aug15', 'Sep15', 'Oct15', 'Nov15', 'Dec15']