Parsing html page with lxml in python

Question

i want to parse this Xpath query with lxml in python.

.//*[@id='content_top']/article/div/table/tbody/tr[5]/td/p/text()

I checked the xpath query in Firepath (the firebug extension for xpath),and it works,but my python code show me nothing. Here's the source.

from lxml import html
import requests

page = requests.get("http://www.scienzeetecnologie.uniparthenope.it/avvisi.html")
tree = html.fromstring(page.text)
avvisi = tree.xpath(".//*[@id='content_top']/article/div/table/tbody/tr[5]/td/p/text()")
print(avvisi)

The output is a "[]".

score 1 · Accepted Answer · edited May 23 '17 at 12:21

1

There is no actual <tbody> element in the source html, its just an element in the DOM added by the HTML parser.

The firebug actually displays the DOM (and I am guessing firepath , which is a firebug extension works on this DOM (rather than the source html)).

For a more detailed explanation on <tbody> and why firebug displays it , check the answers to the SO question - Why does firebug add <tbody> to <table>? or this question - Why do browsers insert tbody element into table elements?

In your case, removing the <tbody> from the xpath, would make it work , Example -

avvisi = tree.xpath(".//*[@id='content_top']/article/div/table/tr[5]/td/p/text()")

edited May 23 '17 at 12:21

Community

1
1

answered Aug 02 '15 at 14:14

Anand S Kumar

88,551
18
188
176

THANKS MAN YOU MADE MY DAY! :) But why in the list ouput i have this strange chars? like \xa0 or similar? There is a way to avoid printing them? – cdm Aug 02 '15 at 14:22
Print each element on the list separately, when you are printing the list as such, you are getting the `repr()` output of the strings. – Anand S Kumar Aug 02 '15 at 14:23
Something like - `for i in avvisi: print(i)` – Anand S Kumar Aug 02 '15 at 14:29
Glad I could be helpful. – Anand S Kumar Aug 02 '15 at 14:32

Parsing html page with lxml in python

1 Answers1