0

This code returns one element under Python 2.7.9 and no elements under 3.4.3. Why? How do I fix it for Python 3?

import requests
from lxml import html

page = requests.get('http://www.bloomberg.com/markets/rates-bonds/government-bonds/us/').text
tree = html.fromstring(page)

line = tree.xpath('//table[@class="std_table_module dual_border_data_table clear"][2]')
print(line)
foosion
  • 7,619
  • 25
  • 65
  • 102
  • Are you using the same `lxml` version for both? – Jon Clements Feb 27 '15 at 13:31
  • @JonClements 3.4.2 for both – foosion Feb 27 '15 at 13:34
  • 2
    neither 2 or 3 return any data for me – Padraic Cunningham Feb 27 '15 at 13:34
  • @PadraicCunningham I get [] in 2 and [] in 3. I'm running under win7 64, if that matters. I you change the [2] to [1] do you get anything? – foosion Feb 27 '15 at 13:36
  • `lxml` version? `from lxml import etree etree.LXML_VERSION` – Padraic Cunningham Feb 27 '15 at 13:36
  • Might also want to check how often that page updates... Is it possible you've run it in 2.7 at one point, the page has changed, so when running again later, you don't get matches...? – Jon Clements Feb 27 '15 at 13:38
  • @Padraic he's already answered that ^^^^ :p – Jon Clements Feb 27 '15 at 13:38
  • @JonClements I've run it numerous times in rapid succession and always get the same result – foosion Feb 27 '15 at 13:39
  • @JonClements lol thought that was python version! – Padraic Cunningham Feb 27 '15 at 13:40
  • @foosion, I can find the table using beautifulSoup no problem – Padraic Cunningham Feb 27 '15 at 13:40
  • @PadraicCunningham The first part of the table (using [1] instead of [2]) works for me in both python 2 and 3. [2] is the problem under python 3. – foosion Feb 27 '15 at 13:42
  • what exactly are you trying to extract from the page? – Padraic Cunningham Feb 27 '15 at 13:46
  • I cannot even reproduce this in Python 3.4 with `lxml` 3.4.2, I get the table element no problem. Side note: *don't use `response.text`*. XML and HTML inform the parser what codec to use, so always use `response.content` instead. *In this case* it makes no difference to the outcome, however. – Martijn Pieters Feb 27 '15 at 13:48
  • @MartijnPieters Odd that you, padraic and I are getting different results. I use content, but thought trying text might help here. What OS are you using? – foosion Feb 27 '15 at 13:52
  • 1
    @foosion: I am using Mac OS X. I *have* seen issues reported with `lxml` on Windows before though; what version of `libxml2` is being used? `etree.LIBXML_VERSION` and `etree.LIBXSLT_VERSION` for me show 2.9.0 and 1.1.28, respectively. There is also `etree.LIBXML_COMPILED_VERSION` and `etree.LIBXSLT_COMPILED_VERSION`. – Martijn Pieters Feb 27 '15 at 13:58
  • @MartijnPieters 2.9.0 and 1.1.28 under Python 2 and 2.9.2 and 1.1.28 under Python 3 (compiled versions are the same). Perhaps there's an issue with 2.9.2 under win 7 64? – foosion Feb 27 '15 at 14:03
  • @foosion: sounds like it. – Martijn Pieters Feb 27 '15 at 14:11
  • @MartijnPieters The problem is finding a pre-compiled win 64 ,python 64 version to test. I got mine from http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml and I haven't seen another source (there are other sources for a 32 bit version). – foosion Feb 27 '15 at 14:13
  • @foosion: Did you try the [`libxml2` binaries](http://xmlsoft.org/sources/win32/)? – Martijn Pieters Feb 27 '15 at 14:14
  • @MartijnPieters I haven't found a win 64 bit version (other than as part of the entire package at http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml). Your link is the 32 bit version. – foosion Feb 27 '15 at 14:15
  • A search for `libxml2 windows x64` turns up https://code.google.com/p/library-prebuilt-for-windows/downloads/detail?name=libxml2-2.9.0-mingw-x64.7z&can=2&q= and https://forge.imag.fr/frs/?group_id=184&release_id=320 – Martijn Pieters Feb 27 '15 at 14:17
  • 1
    Strongly suggest separating page retrieval part from XPath selection part, both for your testing sanity and for the long term value of this question. What is being returned to different clients on different platforms at different times is more likely to vary than XPath across Python versions. – kjhughes Feb 27 '15 at 14:22
  • @MartijnPieters Now I have to figure out how to install that. – foosion Feb 27 '15 at 14:25
  • @kjhughes: the responses are the same, the only difference so far located is the version of the `libxml2` library used. – Martijn Pieters Feb 27 '15 at 14:26
  • @MartijnPieters. I am using linux with lxml 3.4.2 and I can reproduce – Padraic Cunningham Feb 27 '15 at 14:32
  • @MartijnPieters: With all due respect, that the responses have been established as being the same is far from obvious, and having a minimal, static page that demonstrates the issue would be better for testing sanity and the quality of the question for future reference. Thanks. – kjhughes Feb 27 '15 at 14:33
  • @kjhughes fair point – foosion Feb 27 '15 at 14:34
  • @PadraicCunningham which libxml2 version do you have for python 2 and for 3? – foosion Feb 27 '15 at 14:35
  • @MartijnPieters I can't figure out how to install that version of libxml2 (at least not without compiling). I'll start a new question on that issue if I can't find another way soon. – foosion Feb 27 '15 at 14:36
  • @PadraicCunningham: what versions of the libraries? – Martijn Pieters Feb 27 '15 at 14:41
  • @MartijnPieters I downloaded a local copy. It works fine under both python 2 and 3. This seems to suggest the problem is with requests or with encoding? – foosion Feb 27 '15 at 15:00
  • 1
    @foosion: does your downloaded copy match `requests.content`? What are the differences (use `print('\n'.join(difflib.ndiff(source.splitlines(), target.splitlines())))` to look for differences, should work if you decode both with the same encoding). – Martijn Pieters Feb 27 '15 at 15:02
  • 1
    @MartijnPieters, `lxml2 2.9.1+dfsg1-3ubuntu4.4` and `lxml 3.4.2`, I get no output using python2 or 3 – Padraic Cunningham Feb 27 '15 at 15:12
  • @PadraicCunningham: So 2.9.0 (me on 2.7 and 3.4, the OP on 2.7) doesn't show the problem, while 2.9.1 and 2.9.2 (you on 2.7 and 3.4 and the OP on 3.4) and there *is* a problem. That's at least a correlation. – Martijn Pieters Feb 27 '15 at 15:14
  • @MartijnPieters The only diffs are some numeric values (time stamps and trading prices) and couple of cases ¾ v ½. The later is more likely to be an issue. – foosion Feb 27 '15 at 15:16
  • @MartijnPieters BTW, I decode the requests line, and encode the print, with utf8. Also, I didn't download the entire page, just cut and pasted the key portion into my code. – foosion Feb 27 '15 at 15:19
  • I compared what requests downloaded to the actual page source with diffchecker and it is identical – Padraic Cunningham Feb 27 '15 at 15:31
  • ¾ are ½ the 1/2 and 3/4 chars. That was not the issue. This brings us back to the version differences. – foosion Feb 27 '15 at 15:34
  • @MartijnPieters FWIW, request for help installing 2.9.0 at http://stackoverflow.com/questions/28768925/how-to-install-libxml2-2-9-0-for-lxml-for-python-3-4-3-on-win-7-64 – foosion Feb 27 '15 at 15:45

0 Answers0