Your lxml
code is very close to working. The main problem is that the table
tag is not the one with the class="last"
attribute. Rather, it is a tr
tag that has that attribute:
</tr><tr class="last"><td>TRADING HOURS</td>
Thus,
//table[@class="last"]
has no matches. There is also a minor syntax error: @id"tradingHours"
should be @id="tradingHours"
.
You can also omit //table[@class="last"]
entirely since table[@id="tradingHours"]
is specific enough.
The closest analog to your BeautifulSoup code would be:
import urllib2
import lxml.html as LH
url = 'https://www.theice.com/productguide/ProductSpec.shtml?specId=251'
doc = LH.parse(urllib2.urlopen(url))
for td in doc.xpath('//table[@id="tradingHours"]//td/text()'):
print(td.strip())
The grouper recipe, zip(*[iterable]*n)
, is often very useful when parsing tables. It collects the items in iterable
into groups of n
items. We could use it here like this:
texts = iter(doc.xpath('//table[@id="tradingHours"]//td/text()'))
for group in zip(*[texts]*5):
row = [item.strip() for item in group]
print('\n'.join(row))
print('-'*80)
I'm not terribly good at explaining how the grouper recipe works, but I've made an attempt here.
This page is using JavaScript to reformat the dates. To scrape the page after the JavaScript has altered the contents, you could use selenium:
import urllib2
import lxml.html as LH
import contextlib
import selenium.webdriver as webdriver
url = 'https://www.theice.com/productguide/ProductSpec.shtml?specId=251'
with contextlib.closing(webdriver.PhantomJS('phantomjs')) as driver:
driver.get(url)
content = driver.page_source
doc = LH.fromstring(content)
texts = iter(doc.xpath('//table[@id="tradingHours"]//td/text()'))
for group in zip(*[texts]*5):
row = [item.strip() for item in group]
print('\n'.join(row))
print('-'*80)
yields
NEW YORK
8:00 PM-2:15 PM *
20:00-14:15
7:30 PM
19:30
--------------------------------------------------------------------------------
LONDON
1:00 AM-7:15 PM
01:00-19:15
12:30 AM
00:30
--------------------------------------------------------------------------------
SINGAPORE
8:00 AM-2:15 AM *
08:00-02:15
7:30 AM
07:30
--------------------------------------------------------------------------------
Note that in this particular case, if you did not want to use selenium, you could use pytz to parse and convert the times yourself:
import dateutil.parser as parser
import pytz
text = 'Tue Jul 30 20:00:00 EDT 2013'
date = parser.parse(text)
date = date.replace(tzinfo=None)
print(date.strftime('%I:%M %p'))
# 08:00 PM
ny = pytz.timezone('America/New_York')
london = pytz.timezone('Europe/London')
london_date = ny.localize(date).astimezone(london)
print(london_date.strftime('%I:%M %p'))
# 01:00 AM