Parsing with lxml xpath

Question

I was trying to implement a lxml, xpath code to parse html from link: https://www.theice.com/productguide/ProductSpec.shtml?specId=251 Specifically, I was trying to parse the <tr class="last"> table at near the end of the page.

I wanted to obtain the text in that sub-table, for example: "New York" and the hours listed next to it (and do the same for London and Singapore) .

I have the following code (which doesn't work properly):

doc = lxml.html.fromstring(page)
tds = doc.xpath('//table[@class="last"]//table[@id"tradingHours"]/tbody/tr/td/text()')

With BeautifulSoup:

table = soup.find('table', attrs={'id':'tradingHours'})
for td in table.findChildren('td'):
    print td.text

What is the best method to achieve this? I want to use lxml not beautifulSoup (just to see the difference).

score 5 · Accepted Answer · edited May 23 '17 at 11:44

Your lxml code is very close to working. The main problem is that the table tag is not the one with the class="last" attribute. Rather, it is a tr tag that has that attribute:

    </tr><tr class="last"><td>TRADING HOURS</td>&#13;

Thus,

//table[@class="last"]

has no matches. There is also a minor syntax error: @id"tradingHours" should be @id="tradingHours".

You can also omit //table[@class="last"] entirely since table[@id="tradingHours"] is specific enough.

The closest analog to your BeautifulSoup code would be:

import urllib2
import lxml.html as LH

url = 'https://www.theice.com/productguide/ProductSpec.shtml?specId=251'
doc = LH.parse(urllib2.urlopen(url))
for td in doc.xpath('//table[@id="tradingHours"]//td/text()'):
    print(td.strip())

The grouper recipe, zip(*[iterable]*n), is often very useful when parsing tables. It collects the items in iterable into groups of n items. We could use it here like this:

texts = iter(doc.xpath('//table[@id="tradingHours"]//td/text()'))
for group in zip(*[texts]*5):
    row = [item.strip() for item in group]
    print('\n'.join(row))
    print('-'*80)

I'm not terribly good at explaining how the grouper recipe works, but I've made an attempt here.

This page is using JavaScript to reformat the dates. To scrape the page after the JavaScript has altered the contents, you could use selenium:

import urllib2
import lxml.html as LH
import contextlib
import selenium.webdriver as webdriver

url = 'https://www.theice.com/productguide/ProductSpec.shtml?specId=251'
with contextlib.closing(webdriver.PhantomJS('phantomjs')) as driver:
    driver.get(url)
    content = driver.page_source
    doc = LH.fromstring(content)
    texts = iter(doc.xpath('//table[@id="tradingHours"]//td/text()'))
    for group in zip(*[texts]*5):
        row = [item.strip() for item in group]
        print('\n'.join(row))
        print('-'*80)

yields

NEW YORK
8:00 PM-2:15 PM *
20:00-14:15
7:30 PM
19:30
--------------------------------------------------------------------------------
LONDON
1:00 AM-7:15 PM
01:00-19:15
12:30 AM
00:30
--------------------------------------------------------------------------------
SINGAPORE
8:00 AM-2:15 AM *
08:00-02:15
7:30 AM
07:30
--------------------------------------------------------------------------------

Note that in this particular case, if you did not want to use selenium, you could use pytz to parse and convert the times yourself:

import dateutil.parser as parser
import pytz

text = 'Tue Jul 30 20:00:00 EDT 2013'
date = parser.parse(text)
date = date.replace(tzinfo=None)
print(date.strftime('%I:%M %p'))
# 08:00 PM

ny = pytz.timezone('America/New_York')
london = pytz.timezone('Europe/London')
london_date = ny.localize(date).astimezone(london)
print(london_date.strftime('%I:%M %p'))
# 01:00 AM

Thanks, I was looking for this sort of answer. Would it be possible to differentiate between the cities like "New York" and the times using `xpath`. For example, this current `for` loop is printing everything, but I want to bucket the results as they are in the site: a city with its timings. — James Hallen, Jul 31 '13 at 02:36
Thanks for the above, but something's not right. The results I'm getting are like: `Tue Jul 30 20:00:00 EDT 2013-Tue Jul 30 14:15:00 EDT 2013 * Tue Jul 30 19:30:00 EDT 2013`. It's the same timing, repeated over 3 times. Plus it adds the extra date feature, which I'm not sure where it's coming from. It should strictly be the times as in the site. Do you know what's happening here? Even if I use `BS`, it's the same result... — James Hallen, Jul 31 '13 at 11:32
The page is using JavaScript to alter the HTML. `urllib2.urlopen` is downloading the HTML without any JavaScript processing. The browser is showing you the result after JavaScript processing. To scrape the page after JavaScript processing, you could use selenium (see above). — unutbu, Jul 31 '13 at 12:11
So it's the same problem as my previous question, thanks again. — James Hallen, Jul 31 '13 at 21:13
If you don't mind, how can you tell which parts/scripts are being processed by JavaScript? — James Hallen, Aug 01 '13 at 00:41
My (very limited) understanding is that you'd have to review all the code in the ` — unutbu, Aug 01 '13 at 00:48

score 1 · Answer 2 · answered Jul 30 '13 at 17:57

I like css selectors much adaptive on page changes than xpaths:

import urllib
from lxml import html

url = 'https://www.theice.com/productguide/ProductSpec.shtml?specId=251'

response = urllib.urlopen(url).read()

h = html.document_fromstring(response)
for tr in h.cssselect('#tradingHours tbody tr'):
    td = tr.cssselect('td')
    print td[0].text_content(), td[1].text_content()

score 1 · Answer 3 · answered Jul 30 '13 at 19:02

If the site is proper html, id attributes are unique and you can find the table at doc.get_element_by_id('tradingHours').

import urllib
from lxml import html

url = 'https://www.theice.com/productguide/ProductSpec.shtml?specId=251'

response = urllib.urlopen(url).read()

h = html.document_fromstring(response)

print "BY ID"
tradingHours = h.get_element_by_id('tradingHours')

for tr in tradingHours.xpath('tbody/tr'):
    tds = tr.xpath('td')
    print tds[0].text.strip()
    for td in tds[1:]:
        print ' ', td.text.strip()

Results in

BY ID
NEW YORK
  Tue Jul 30 20:00:00 EDT 2013-Tue Jul 30 14:15:00 EDT 2013 *
  Tue Jul 30 19:30:00 EDT 2013
LONDON
  Tue Jul 30 20:00:00 EDT 2013-Tue Jul 30 14:15:00 EDT 2013
  Tue Jul 30 19:30:00 EDT 2013
SINGAPORE
  Tue Jul 30 20:00:00 EDT 2013-Tue Jul 30 14:15:00 EDT 2013 *
  Tue Jul 30 19:30:00 EDT 2013

Parsing with lxml xpath

3 Answers3

Linked