1

I am trying to scrape data from this page using the lxml module in Python. I want to get the text in the first paragraph, but the following code is returning null value

from lxml import html
import requests

page = requests.get('http://www.thehindu.com/todays-paper/with-afspa-india-has-failed-statute-amnesty/article7376286.ece')
tree = html.fromstring(page.text)
data = tree.xpath('//*[@id="left-column"]/div[6]/p[1]/text()')
print data
  • Well, at least when I fetch the page, `[@id="left-column"]` is empty. – dhke Jul 09 '15 at 15:34
  • @dhke- when I inspect the element for the page, and copy the xpath corresponding to that parapraph, this is the path that I get. Am I doing something wrong here? – Saharsh Agarwal Jul 09 '15 at 15:43
  • Actually, when I try with `//div[class='articleLead']` or `//xh:div[class='articleLead']` (with `namespaces={'xh': 'http://www.w3.org/1999/xhtml'}`), the result is still empty even though I can clearly see that element ... – dhke Jul 09 '15 at 15:55
  • even if I replace that line by `data = tree.xpath('//*[@class="body"]/text()')` I'm not getting any value – Saharsh Agarwal Jul 09 '15 at 15:57
  • 2
    Try to download (outside of the browser) the file separately and check contents. Because the raw data does not match what you see in the browser. This seems either (nasty) bug or they have some kind of scraping protection in place that edits the DOM after page load. – dhke Jul 09 '15 at 16:03

2 Answers2

0

Try //div[class='article-text']/p/text()

Brent D
  • 898
  • 5
  • 16
0

you can use xpath as follow :

div[@class='article-text']/p[1]/text()
Soner Gönül
  • 97,193
  • 102
  • 206
  • 364
Piyush
  • 511
  • 4
  • 13