2

I wrote a code that can scrape google news search results. But It always scrapes just first page. How to write a loop that allows me to scrape first 2,3...n pages?

I know that In url I need to add parameter for page, and to put all in for loop, but I do not know how?

This code gives me headlines, paragraphs and dates of first search page:

from bs4 import BeautifulSoup
import requests

headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

term = 'usa'
url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(term)# i know that I need to add this parameter for page, but I  do not know how

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

headline_text = soup.find_all('h3', class_= "r dO0Ag")

snippet_text = soup.find_all('div', class_='st')

news_date = soup.find_all('div', class_='slp')

Also, can this logic for google news and pages be applied to for example bing news or yahoo news, I mean, can I use the same parameter or is it that url is different?

taga
  • 3,537
  • 13
  • 53
  • 119
  • 1
    Be careful, because google has some powerful anti-scraping measures and you might get blocked. If you don't want to develop a very safe scraper (IP rotation, human movement replication, etc), you might consider using one of Google's APIs to get your data – Juan C Nov 20 '19 at 14:07
  • You can do `url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws&page={1}'.format(term,page)` Take a look https://stackoverflow.com/questions/38635419/searching-in-google-with-python – AlexDotis Nov 20 '19 at 14:34
  • I have tried that, but It always returns me the first page, I can put whatever number for page, it always returns me content for num 1 page – taga Nov 20 '19 at 14:39

2 Answers2

4

I think you need to change your url.Try below code see if this work.

from bs4 import BeautifulSoup
import requests

headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

term = 'usa'
page=0


while True:
    url = 'https://www.google.com/search?q={}&tbm=nws&sxsrf=ACYBGNTx2Ew_5d5HsCvjwDoo5SC4U6JBVg:1574261023484&ei=H1HVXf-fHfiU1fAP65K6uAU&start={}&sa=N&ved=0ahUKEwi_q9qog_nlAhV4ShUIHWuJDlcQ8tMDCF8&biw=1280&bih=561&dpr=1.5'.format(term,page)
    print(url)

    response = requests.get(url, headers=headers,verify=False)
    if response.status_code!=200:
        break
    soup = BeautifulSoup(response.text, 'html.parser')

    headline_text = soup.find_all('h3', class_= "r dO0Ag")

    snippet_text = soup.find_all('div', class_='st')

    news_date = soup.find_all('div', class_='slp')
    page=page+10
KunduK
  • 32,888
  • 5
  • 17
  • 41
  • It works now, but Is there a way to do it without such a long url? I mean, If I want only page 1 , url is 3 tunes shorter. Also, how wold this url look for yahoo and bing – taga Nov 20 '19 at 15:07
  • 1
    Then request might not help you.Then you have to use browser tool like selenium WebDriver and click on each pagination link to get the new page value. – KunduK Nov 20 '19 at 15:11
  • @taga: If you have any new research and you have found issue then please post a new question and mentioned what you are after.If I am not some other contributor definately help you out.Thanks. – KunduK Nov 20 '19 at 15:18
  • Hey, can you help me with this? https://stackoverflow.com/questions/59047342/generating-url-for-yahoo-and-bing-scrapping-for-multiple-pages-with-python-and-b – taga Nov 26 '19 at 09:57
1

Code and full example in the online IDE to test out:

from bs4 import BeautifulSoup
import requests, urllib.parse

def paginate(url, previous_url=None):
    # Break from infinite recursion
    if url == previous_url: return

    headers = {
        "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
    }

    response = requests.get(url, headers=headers).text
    soup = BeautifulSoup(response, 'lxml')

    # First page
    yield soup

    next_page_node = soup.select_one('a#pnnext')

    # Stop when there is no next page
    if next_page_node is None: return

    next_page_url = urllib.parse.urljoin('https://www.google.com/',
                                         next_page_node['href'])

    # Pages after the first one
    yield from paginate(next_page_url, url)


def scrape():
    pages = paginate(
        "https://www.google.com/search?hl=en-US&q=coca+cola&tbm=nws")

    for soup in pages:
        print(f'Current page: {int(soup.select_one(".YyVfkd").text)}')
        print()

        for data in soup.findAll('div', class_='dbsr'):
            title = data.find('div', class_='JheGif nDgy9d').text
            link = data.a['href']

            print(f'Title: {title}')
            print(f'Link: {link}')
            print()


Alternatively, you can achieve the same thing by using Google News Results API from SerpApi. It's a paid API with a free plan.

The difference in your case is that it supports multiple search engines and the setup process is fast and straightforward. You don't have to maintain the parser or figure out how to bypass blocks from Google or other engines or how to extract certain elements since it's already done for the end-user.

Code to integrate:

# https://github.com/serpapi/google-search-results-python
from serpapi import GoogleSearch
import os

def scrape():
  params = {
    "engine": "google",
    "q": "gta san andreas",
    "tbm": "nws",
    "api_key": os.getenv("API_KEY"),
  }

  search = GoogleSearch(params)
  pages = search.pagination()

  for result in pages:
    print(f"Current page: {result['serpapi_pagination']['current']}")

    for news_result in result["news_results"]:
        print(f"Title: {news_result['title']}\nLink: {news_result['link']}\n")

P.S - I wrote a bit more detailed blog post about how to scrape Google News.

Disclaimer, I work for SerpApi.

Dmitriy Zub
  • 1,398
  • 8
  • 35