Xpath not working properly

Question

I am working with lxml and requests to scrape data for a language development program for some of my friends who want to learn English . So i am currently working with the slang learning part of the program , I'll just skip to the main problem now.

Here is a sample page which i am using to demonstrate my problem.

import requests
from lxml import html
def make_tree(url):
    headers = {'User-Agent':'Mozilla/5.0'}
    page = requests.post(url,headers=headers)
    return html.fromstring(page.text)

url = 'http://www.englishdaily626.com/slang.php?054'

t = make_tree(url)
print t.xpath('/html/body/p/table/tbody/tr/td/table[4]/tbody/tr[3]/td[2]/table/tbody/tr/td[2]/div/table/tbody/tr[2]/td[2]/p/span/text()')

this just gives me a blank list . My xpath is correct if checked in xpath viewer firefox. what is the problem then ? and is occuring everywhere expect for href .

This may be due to changes which are made by scripting elements in the page. I would suggest to do the following: download the 'raw' page using wget. Then check if you can find your XPath expression in there. — Marcus Rickert, Nov 24 '13 at 12:57
maybe there something to do with namespaces , i'm trying learn about it .. — user3027126, Nov 24 '13 at 13:10
What do you mean by _it din't work_? You could find the XPath expression in the downloaded raw file? — Marcus Rickert, Nov 24 '13 at 13:29
I've setup a test environment on my machine. The problem already occurs at level `/html/body/p/table` which is empty although `/html/body/p` returns a result set with three `
` nodes in it and the first `
` definitely has a `
` in it. — Marcus Rickert, Nov 24 '13 at 14:17
possible duplicate of [Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing?](http://stackoverflow.com/questions/18241029/why-does-my-xpath-query-scraping-html-tables-only-work-in-firebug-but-not-the) — Jens Erat, Nov 24 '13 at 14:22
There is no `` element in that page, firebug adds it. See http://stackoverflow.com/questions/18241029/why-does-my-xpath-query-scraping-html-tables-only-work-in-firebug-but-not-the/18241030#18241030. — Jens Erat, Nov 24 '13 at 14:26
I think there's a combination of problems. First, @JensErat is right. There are no `` tags in the raw file. Second, the correction algorithm of `lxml` is apparently not able to handle the non-closed `
` tags at level 3. My browser says there are five `
` tags and one `
` tag below the `` tag. `lxml` thinks there are three `
` tags, two `
` tags and one `` tag. I think it would be wiser to phrase the XPath more logically using attribute values than structurally. Could you tell us which experpt exactly you are looking for?
– Marcus Rickert Nov 24 '13 at 18:21 — Marcus Rickert, Nov 24 '13 at 18:21
@Markus after removing all the `/tbody` elements and the starting `/p` element it works like a charm . Thanks for sticking up with me buddy !! — user3027126, Nov 24 '13 at 20:03

score 0 · Answer 1 · answered Dec 05 '13 at 17:05

0

I'd recommend using a more flexible, general XPath query. If you're looking for the first definition, you could use this:

'//tr[td[1]/p/b/span = "Definition:"][1]/td[2]/p/span/text()'

This works in a browser and with lxml in your example script.

answered Dec 05 '13 at 17:05

brechin

569
4
7

Xpath not working properly

1 Answers1