How to grab data using XPath on javascript websites?

Question

I would like to grab data on this news site. http://www.inquirer.net/

I want to grab news titles on the tiles.

Here's the screen shot of the inspected code

As you can see, one of the title of the tile that I want to grab is already there. When I copy the xpath from the browser it returns //*[@id="tgs3_info"]/h2

I tried to run my python code.

import lxml.html
import lxml.etree
import requests

link = 'http://www.inquirer.net/'
res = requests.get(link)
r = res.content
html_content = lxml.html.fromstring(r)
root = html_content.xpath('//*[@id="tgs3_info"]/h2')
print(root)

but it returns an empty list.

I tried to search for an answer here on stackoverflow and in the internet. I don't really get it. When you view the page source of the site. The data that I want is not in the javascript function. It is in the div so I don't understand why I can't grab the data. I hope I could find answer here.

check the value of `r` whether it has the required data – Van Peer Oct 26 '17 at 05:17 — Van Peer, Oct 26 '17 at 05:17

Van Peer · Accepted Answer · 2017-10-26T08:48:28.810

0

With inputs from Xurasky's solution to avoid a 403 error

import lxml.html
import lxml.etree
from urllib.request import Request, urlopen

req = Request('http://www.inquirer.net/', headers={'User-Agent': 'Mozilla/5.0'})
r = urlopen(req).read()
html_content = lxml.html.fromstring(r)
root = html_content.xpath('//*[@id="tgs3_info"]/h2')
for a in root:
    print(a.text_content())

Output

Duterte, Roque meeting set in Malacañang
2 senators welcome Ventura's revelations in Atio hazing case
Paolo Duterte vows to retire from politics in 2019
NBA: DeMarcus Cousins regrets being loyal to Sacramento Kings
PH bet Elizabeth Durado Clenci wins 2nd runner-up at Miss Grand International 2017
DOJ wants Divina, 50 others in `Atio' hazing case added on BI watchlist
Georgina Wilson Shares Messages From Fans on Baby Blues

edited Oct 26 '17 at 08:48

answered Oct 26 '17 at 05:37

Van Peer

2,127
2
25
35

Thank you for your answer. I tried to run your code but it also return an empty result. – Slet Oct 26 '17 at 07:33
r is having data. I've tried Xurasky's solution and it still return empty result – Slet Oct 26 '17 at 08:02
Your updated post is what I actually needed. Can you please explain how does it work? especially this parts. from urllib.request import Request, urlopen req = Request('http://www.inquirer.net/', headers={'User-Agent': 'Mozilla/5.0'}) r = urlopen(req).read() – Slet Oct 26 '17 at 08:54
My r is returning data right. That's why I would like to ask you the explanation of the line from urllib.request import Request, urlopen req = Request('http://www.inquirer.net/', headers={'User-Agent': 'Mozilla/5.0'}) r = urlopen(req).read() – Slet Oct 26 '17 at 11:16
@Slet can you please check what `r` is returning in your original code? also, what urllib does, you can read this https://stackoverflow.com/a/2018074/1836483 – Van Peer Oct 26 '17 at 11:47

score 0 · Answer 2 · answered Oct 26 '17 at 05:42

I believe you are getting a urllib.error.HTTPError: HTTP Error 403: Forbidden Error.

You can fix this by using

import lxml.html
import lxml.etree
from urllib.request import Request, urlopen

req = Request('http://www.inquirer.net/', headers={'User-Agent': 'Mozilla/5.0'})
res = urlopen(req).read()
html_content = lxml.html.fromstring(r)
root = html_content.xpath('//*[@id="tgs3_info"]/h2')
print(root)

How to grab data using XPath on javascript websites?

2 Answers2