0

I'm writing some code to scrape information from Google search results. Ideally, I would like to be able to scrape around 1000 results (urls) based on a set of keywords. Now, I'm getting reguraly blocked (temporarily) for sending too many requests. This usually happens after around 300 results (titles/urls) are scraped from the Google Search results. Lately, I've been reading a lot on how to circumvent this, however without much luck.

Now, I would like to obtain the maximum amount of results without getting blocked. Specifically, I would like reduce the amount of requests I'm sending. This would ease the pressure on the web, therefore reducing the likelihood of getting blocked. However, I do not know how to implement this while opening the website with urllib.Requests and urlopen.read(). Any help on this? Btw, I do not mind obtaining 1000 URLs over the course of, say, 3-4 hours. Code below:

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import random
import time
from time import sleep

user_agent_list = [    
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36 Edg/90.0.818.56',
    'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko'
    ] 

# Scrape URLs based on Google keyword combinations
root = "https://www.google.com/"
url  = "https://google.com/search?q="

csv_fn = 'export.csv'

def news(link, n_pages=5, page_count=0):
    print(page_count)
    
    sleep(random.uniform(5, 10))
    
    # Insert a session somewhere here?
 
    req = Request(link, headers={"User-Agent": 'Mozilla/5.0'})

    webpage = urlopen(req).read()
        
  
    soup = BeautifulSoup(webpage, 'html5lib')

    for item in soup.find_all('div', attrs = {'class': 'ZINbbc xpd O9g5cc uUPGi'}): # 'kCrYT' for larger link string
        start_time = time.time()
        
        try:
            title = (item.find('div', attrs= {'class': 'BNeawe vvjwJb AP7Wnd'}).get_text())
        except:
            print('No title found')
            continue
        
        # Export to CSV
        document = open(csv_fn, "a", encoding='utf-8')
        document.write("{}\n".format(title))
        document.close()

    nextPage = soup.find('a', attrs = {'aria-label': "Volgende pagina"}) # Or 'Next'
    if nextPage is None:
        nextPage = soup.find('a', attrs = {'aria-label': "Next"})

    # Only continue to next page if present!
    page_count += 1
    if nextPage is not None and page_count < n_pages:
        nextPage = (nextPage['href'])
        link = root + nextPage
        news(link, n_pages=n_pages, page_count=page_count)
    elif page_count >= n_pages:
        print(f'Scraped {n_pages} pages, done!')
    else:
        print('No next page found, terminating scraping...')

link = url + 'whale+watching+orca+iceland'
news(link, n_pages = 5)

Btw other tips to prevent Google from blocking are all very welcome. Although it seems that tips such as IP rotation, for free, are a bit too much to ask when scraping Google results.

CrossLord
  • 574
  • 4
  • 20
  • Please remove the "nlp" tag and instead add the "beautiful-soup" and "urllib" tags. Further, is there any reason for using urllib and not using https://docs.python-requests.org/en/master/index.html? – Ganesh Tata May 18 '21 at 13:32
  • 1
    You're already using `sleep` in your code, if you want to wait longer between requests simply increase the sleep time. – l4mpi May 18 '21 at 13:35
  • @l4mpi yes I'm aware that I'm using sleep, however even with the sleep command I obtain a "Too many Requests" error after ~ 300 items. So just sleep is not cutting it. – CrossLord May 18 '21 at 15:07
  • @GaneshTata. Well the reason I'm not using requests is that I could not find one effective way to scrape the titles from each next Google Results page. If you have any example/link that shows this I would love to check it out. – CrossLord May 18 '21 at 15:15
  • @CrossLord. If the Selenium answer below is not working for your purposes, please share why it is insufficient. – deseuler May 18 '21 at 15:34

1 Answers1

0

I would use Selenium for this. It lets you automate usage of google chrome and permits usage of JavaScript, making the solution less reliant on parsing.

I also tested this method and it with the query you provided. It ran out of pages after 25 pages or ~250 titles.

I am not entirely certain how you want your CSV structured, so Ill just show you how to get the news titles.

import time
import random
from selenium import webdriver
import selenium
from webdriver_manager.chrome import ChromeDriverManager

csv_fn = 'export.csv'
driver = webdriver.Chrome(ChromeDriverManager().install()) #Get Automated Google Chrome Drivers
driver.maximize_window()

def google_search(query):
    driver.get('http://www.google.com/')
    search_box = driver.find_element_by_name('q') #Search bar
    search_box.send_keys(query) #Type query
    search_box.submit()         #Hit Enter
    time.sleep(2)

    #Find news tab
    tabs = driver.find_elements_by_class_name("hide-focus-ring")
    for tab in tabs:
        if tab.text == 'News':
            news_tab = tab
            break
    driver.get(news_tab.get_attribute('href')) #Go to news tab

query = 'whale watching orca iceland'
google_search(query)
num_pages = 0
while num_pages < 200:
    try:
        time.sleep(random.uniform(5, 10))

        #Perform title extraction logic here
        headings = driver.find_elements_by_xpath("//div[@role = 'heading']") #Heading elements
        for heading in headings:
            title = heading.text
            #Do whatever with the title
            print(title)

        #Go to next page
        next = driver.find_element_by_id('pnnext') #Next page
        next.click()
    except selenium.common.exceptions.NoSuchElementException: #No more pages
        break

    num_pages += 1

driver.close()
deseuler
  • 408
  • 2
  • 8
  • Thanks a lot for your input deseuler. However, I'm getting this error when running your script: ElementClickInterceptedException: element click intercepted: Element ... is not clickable at point (491, 107). Other element would receive the click: – CrossLord May 18 '21 at 17:51
  • It seems like something is obscuring the news tab in the chrome browser. I updated the code to be a little more generalized in finding the news tab and go to it without using the click method. Let me know if this change fixes it. – deseuler May 18 '21 at 18:15
  • Thanks again. The previous error is resolved, but now I'm bumping into this error: "UnboundLocalError: local variable 'news_tab' referenced before assignment". – CrossLord May 19 '21 at 06:23
  • That means the news tab is not showing up in your driver. I cant reproduce this issue. – deseuler May 19 '21 at 13:14