1

I am working with lxml and requests to scrape data for a language development program for some of my friends who want to learn English . So i am currently working with the slang learning part of the program , I'll just skip to the main problem now.

Here is a sample page which i am using to demonstrate my problem.

import requests
from lxml import html
def make_tree(url):
    headers = {'User-Agent':'Mozilla/5.0'}
    page = requests.post(url,headers=headers)
    return html.fromstring(page.text)

url = 'http://www.englishdaily626.com/slang.php?054'

t = make_tree(url)
print t.xpath('/html/body/p/table/tbody/tr/td/table[4]/tbody/tr[3]/td[2]/table/tbody/tr/td[2]/div/table/tbody/tr[2]/td[2]/p/span/text()')

this just gives me a blank list . My xpath is correct if checked in xpath viewer firefox. what is the problem then ? and is occuring everywhere expect for href .

  • This may be due to changes which are made by scripting elements in the page. I would suggest to do the following: download the 'raw' page using wget. Then check if you can find your XPath expression in there. – Marcus Rickert Nov 24 '13 at 12:57
  • @marcus thanks dude but i've tried that , it didn't work – user3027126 Nov 24 '13 at 13:08
  • maybe there something to do with namespaces , i'm trying learn about it .. – user3027126 Nov 24 '13 at 13:10
  • What do you mean by _it din't work_? You could find the XPath expression in the downloaded raw file? – Marcus Rickert Nov 24 '13 at 13:29
  • I've setup a test environment on my machine. The problem already occurs at level `/html/body/p/table` which is empty although `/html/body/p` returns a result set with three `

    ` nodes in it and the first `

    ` definitely has a `

    ` in it.
    – Marcus Rickert Nov 24 '13 at 14:17
  • 1
    possible duplicate of [Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing?](http://stackoverflow.com/questions/18241029/why-does-my-xpath-query-scraping-html-tables-only-work-in-firebug-but-not-the) – Jens Erat Nov 24 '13 at 14:22
  • There is no `` element in that page, firebug adds it. See http://stackoverflow.com/questions/18241029/why-does-my-xpath-query-scraping-html-tables-only-work-in-firebug-but-not-the/18241030#18241030. – Jens Erat Nov 24 '13 at 14:26
  • I think there's a combination of problems. First, @JensErat is right. There are no `` tags in the raw file. Second, the correction algorithm of `lxml` is apparently not able to handle the non-closed `

    ` tags at level 3. My browser says there are five `

    ` tags and one `

    ` tag below the `` tag. `lxml` thinks there are three `

    ` tags, two `

    ` tags and one `` tag. I think it would be wiser to phrase the XPath more logically using attribute values than structurally. Could you tell us which experpt exactly you are looking for?
    – Marcus Rickert Nov 24 '13 at 18:21
  • @Markus after removing all the `/tbody` elements and the starting `/p` element it works like a charm . Thanks for sticking up with me buddy !! – user3027126 Nov 24 '13 at 20:03

1 Answers1

0

I'd recommend using a more flexible, general XPath query. If you're looking for the first definition, you could use this:

'//tr[td[1]/p/b/span = "Definition:"][1]/td[2]/p/span/text()'

This works in a browser and with lxml in your example script.

brechin
  • 569
  • 4
  • 7