I'm writing some code to scrape information from Google search results. Ideally, I would like to be able to scrape around 1000 results (urls) based on a set of keywords. Now, I'm getting reguraly blocked (temporarily) for sending too many requests. This usually happens after around 300 results (titles/urls) are scraped from the Google Search results. Lately, I've been reading a lot on how to circumvent this, however without much luck.
Now, I would like to obtain the maximum amount of results without getting blocked. Specifically, I would like reduce the amount of requests I'm sending. This would ease the pressure on the web, therefore reducing the likelihood of getting blocked. However, I do not know how to implement this while opening the website with urllib.Requests
and urlopen.read()
. Any help on this? Btw, I do not mind obtaining 1000 URLs over the course of, say, 3-4 hours. Code below:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import random
import time
from time import sleep
user_agent_list = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36 Edg/90.0.818.56',
'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko'
]
# Scrape URLs based on Google keyword combinations
root = "https://www.google.com/"
url = "https://google.com/search?q="
csv_fn = 'export.csv'
def news(link, n_pages=5, page_count=0):
print(page_count)
sleep(random.uniform(5, 10))
# Insert a session somewhere here?
req = Request(link, headers={"User-Agent": 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage, 'html5lib')
for item in soup.find_all('div', attrs = {'class': 'ZINbbc xpd O9g5cc uUPGi'}): # 'kCrYT' for larger link string
start_time = time.time()
try:
title = (item.find('div', attrs= {'class': 'BNeawe vvjwJb AP7Wnd'}).get_text())
except:
print('No title found')
continue
# Export to CSV
document = open(csv_fn, "a", encoding='utf-8')
document.write("{}\n".format(title))
document.close()
nextPage = soup.find('a', attrs = {'aria-label': "Volgende pagina"}) # Or 'Next'
if nextPage is None:
nextPage = soup.find('a', attrs = {'aria-label': "Next"})
# Only continue to next page if present!
page_count += 1
if nextPage is not None and page_count < n_pages:
nextPage = (nextPage['href'])
link = root + nextPage
news(link, n_pages=n_pages, page_count=page_count)
elif page_count >= n_pages:
print(f'Scraped {n_pages} pages, done!')
else:
print('No next page found, terminating scraping...')
link = url + 'whale+watching+orca+iceland'
news(link, n_pages = 5)
Btw other tips to prevent Google from blocking are all very welcome. Although it seems that tips such as IP rotation, for free, are a bit too much to ask when scraping Google results.