0

I would like to grab data on this news site. http://www.inquirer.net/

I want to grab news titles on the tiles.

Here's the screen shot of the inspected code

As you can see, one of the title of the tile that I want to grab is already there. When I copy the xpath from the browser it returns //*[@id="tgs3_info"]/h2

I tried to run my python code.

import lxml.html
import lxml.etree
import requests

link = 'http://www.inquirer.net/'
res = requests.get(link)
r = res.content
html_content = lxml.html.fromstring(r)
root = html_content.xpath('//*[@id="tgs3_info"]/h2')
print(root)

but it returns an empty list.

I tried to search for an answer here on stackoverflow and in the internet. I don't really get it. When you view the page source of the site. The data that I want is not in the javascript function. It is in the div so I don't understand why I can't grab the data. I hope I could find answer here.

Slet
  • 3
  • 1

2 Answers2

0

With inputs from Xurasky's solution to avoid a 403 error

import lxml.html
import lxml.etree
from urllib.request import Request, urlopen

req = Request('http://www.inquirer.net/', headers={'User-Agent': 'Mozilla/5.0'})
r = urlopen(req).read()
html_content = lxml.html.fromstring(r)
root = html_content.xpath('//*[@id="tgs3_info"]/h2')
for a in root:
    print(a.text_content())

Output

Duterte, Roque meeting set in Malacañang
2 senators welcome Ventura's revelations in Atio hazing case
Paolo Duterte vows to retire from politics in 2019
NBA: DeMarcus Cousins regrets being loyal to Sacramento Kings
PH bet Elizabeth Durado Clenci wins 2nd runner-up at Miss Grand International 2017
DOJ wants Divina, 50 others in `Atio' hazing case added on BI watchlist
Georgina Wilson Shares Messages From Fans on Baby Blues
Van Peer
  • 2,127
  • 2
  • 25
  • 35
  • Thank you for your answer. I tried to run your code but it also return an empty result. – Slet Oct 26 '17 at 07:33
  • r is having data. I've tried Xurasky's solution and it still return empty result – Slet Oct 26 '17 at 08:02
  • Your updated post is what I actually needed. Can you please explain how does it work? especially this parts. from urllib.request import Request, urlopen req = Request('http://www.inquirer.net/', headers={'User-Agent': 'Mozilla/5.0'}) r = urlopen(req).read() – Slet Oct 26 '17 at 08:54
  • My r is returning data right. That's why I would like to ask you the explanation of the line from urllib.request import Request, urlopen req = Request('http://www.inquirer.net/', headers={'User-Agent': 'Mozilla/5.0'}) r = urlopen(req).read() – Slet Oct 26 '17 at 11:16
  • @Slet can you please check what `r` is returning in your original code? also, what urllib does, you can read this https://stackoverflow.com/a/2018074/1836483 – Van Peer Oct 26 '17 at 11:47
0

I believe you are getting a urllib.error.HTTPError: HTTP Error 403: Forbidden Error.

You can fix this by using

import lxml.html
import lxml.etree
from urllib.request import Request, urlopen

req = Request('http://www.inquirer.net/', headers={'User-Agent': 'Mozilla/5.0'})
res = urlopen(req).read()
html_content = lxml.html.fromstring(r)
root = html_content.xpath('//*[@id="tgs3_info"]/h2')
print(root)
dreadera
  • 19
  • 5