0

I am trying to get started parsing html with lxml. I know from basic xpath that / should select the root node, //body should select the body element node wherever it is in the dom, etc. However I am getting an empty list for all of them.

from lxml import html
import urllib2
headers =  {'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:24.0) Gecko/20100101 Firefox/24.0'}
req = urllib2.Request("http://news.ycombinator.com", None, headers)
r = urllib2.urlopen(req).read()
x = html.fromstring(r)
x.xpath("/")
[]

EDIT:

For example, here is another valid xpath expression for that page which returns an empty list

x.xpath("/html/body/center/table/tbody/tr[3]/td/table/tbody/tr[1]/td[3]")
[] 
# when it should have returned the following (as of this time)
# <td class="title"><a href="http://www.tomdalling.com/blog/modern-opengl/opengl-in-2014/">OpenGL in 2014</a><span class="comhead"> (tomdalling.com) </span></td>
yayu
  • 7,758
  • 17
  • 54
  • 86
  • Don't you get this **urllib2.HTTPError: HTTP Error 403: Forbidden** – Nabin Sep 21 '14 at 10:29
  • And what does **[]** do? – Nabin Sep 21 '14 at 10:29
  • 1
    @Nabin Oh, in the actual code I am using a proxy and a fake user agent, which I didn't post. The `[]` is the output of the second last line. I will make this code workable, just a min. – yayu Sep 21 '14 at 10:30
  • @Nabin I've changed the code, `r` contains the html of the homepage now. Tested it on my machine. – yayu Sep 21 '14 at 10:37
  • possible duplicate of [Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing?](http://stackoverflow.com/questions/18241029/why-does-my-xpath-query-scraping-html-tables-only-work-in-firebug-but-not-the) – Jens Erat Oct 12 '14 at 17:37

1 Answers1

1

Regarding your second question: The problem with the xpath expression possibly is the tbody-element. As you can already find multiple questions with a similar problem on Stackoverflow - e.g. here Why do browsers insert tbody element into table elements? and here Why does firebug add <tbody> to <table>?, the short version is that browsers add elements like e.g. head and tbody to the DOM that are not in the sourcecode, so the xpath won't match. You can just omit the tbody:

x.xpath("/html/body/center/table/tr[3]/td/table/tr[1]/td[3]")

which seems to work as stated here: Extracting lxml xpath for html table

But I favor the approach given in the first answer here Python lxml XPath problem, - it should also work if you just omit unnecessary parts of the xpath and shorten the query to the element you're looking for, so instead of

x.xpath("/html/body/center/table/tbody/tr[3]/td/table/tbody/tr[1]/td[3]")

you should get the result with

x.xpath("/html/tr[3]/tr[1]/td[3]")   
Community
  • 1
  • 1
matthias_h
  • 11,356
  • 9
  • 22
  • 40