0

I made a python code to scrape the content of the news articles which searched by keywords on Google news.

def __init__(self,term):
    self.term = term
    self.url ='https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(self.term)
    response = requests.get(self.url)

This code can only get the contents of the first page searched by keywords. I wonder how can I change my code to get second, third or even more pages?

David Guo
  • 31
  • 6

1 Answers1

0

You can do so by appending to the url the &start= query parameter, and placing an integer which specifies where the search page should start displaying results.

For example, since a default page shows 10 results, using

self.url ='https://www.google.com/search?q={0}&source=lnms&tbm=nws&start=10'.format(self.term)

will show you the second page.

So, the generalized result could be something similar to this (you can also modify it in order to change pages after every scrape):

def __init__(self, term, page):
self.term = term
self.subjectivity =0
self.sentiment =0
self.url ='https://www.google.com/search?q={0}&source=lnms&tbm=nws&start={1}'.format(self.term, page * 10)
  • thanks a lot, your code is awesome. actually, I'm a making a code for sentiment analysis. when I try to use my code to analyze the Google pages which the URL generated by the line. Google response me the note as below: About this page

    Our systems have detected unusual traffic from your computer network. This page checks to see if it's really you sending the requests, and not a robot. This page appears when Google automatically detects requests coming from your computer network which appear to be in violation of the Terms of Service. Is there a way to solve the problem?
    – David Guo Apr 06 '19 at 18:53
  • You are triggering a CAPTCHA. You might avoid doing many requests per time unit, or you could use a search API, conforming to their terms of service. I would try using a Google API, or of another search engine if you are not required to explicitly use Google. Check this: https://stackoverflow.com/questions/2445308/is-there-a-way-to-programmatically-access-googles-search-engine-results and https://developers.google.com/custom-search/v1/overview – Pentracchiano Apr 06 '19 at 20:59