You can extract data from Google Search without using API, it will be enough to use BeautifulSoup
web scraping library. Keep in mind that you need to take care of solving CAPTCHA or IP rate limit. Could be done with rotating proxies, user-agents.
You can search for elements on a page using a CSS selectors.
To search for CSS selectors you can use SelectorGadget Chrome extension which allows clicking on the desired element in your browser and returns corresponding CSS selector (not always work perfectly if the website is rendered via JavaScript).
It is also possible to dynamically extract all results from all possible pages using non-token based pagination. It will go through all of them, no matter how many pages there are.
You can add several options for exiting the loop, such as exit by page limit and exit if there is no "next page" button:
if page_num == page_limit: # exit by page limit
break
if soup.select_one(".d6cvqb a[id=pnnext]"): # exit on missing button
params["start"] += 10
else:
break
Check code with pagination in the online IDE.
from bs4 import BeautifulSoup
import requests, json, lxml
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "SmartyKat Catnip Cat Toys", # query
"hl": "en", # language
"gl": "uk", # country of the search, UK -> United Kingdom
"start": 0, # number page by default up to 0
#"num": 100 # parameter defines the maximum number of results to return.
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
}
page_limit = 5
page_num = 0
data = []
while True:
page_num += 1
print(f"page: {page_num}")
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select(".tF2Cxc"):
title = result.select_one(".DKV0Md").text
try:
snippet = result.select_one(".lEBKkf span").text
except:
snippet = None
links = result.select_one(".yuRUbf a")["href"]
data.append({
"title": title,
"snippet": snippet,
"links": links
})
if page_num == page_limit:
break
if soup.select_one(".d6cvqb a[id=pnnext]"):
params["start"] += 10
else:
break
print(json.dumps(data, indent=2, ensure_ascii=False))
Example output:
[
{
"title": "SmartyKat Catnip Chase Cat Toy - I Love My Pets",
"snippet": "Catnip Chase™ compressed catnip toy Play SmartyKat offers a variety of toys to meet a cat's need for hunting, exercise, excitement, interaction, ...",
"links": "https://www.ilovemypets.ph/index.php?route=product/product&product_id=1670"
},
{
"title": "Kitties & Their Humans - Facebook",
"snippet": "5 IN STOCK* SmartyKat Catnip Cat Toys Brand: SmartyKat Style: Madcap Mania™ Refillable Assorted Mice Catnip Cat Toy Style: Mice (Random Selection)...",
"links": "https://m.facebook.com/2674028906242223/"
},
other results ...
]
Also, like alternative, you can use third-party API like Google Search Engine Results API from SerpApi. It's a paid API with a free plan.
The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.
Example SerpApi code with pagination:
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import json, os
params = {
"api_key": "...", # serpapi key from https://serpapi.com/manage-api-key
"engine": "google", # serpapi parser engine
"q": "SmartyKat Catnip Cat Toys", # search query
"num": "100" # number of results per page (100 per page in this case)
# other search parameters: https://serpapi.com/search-api#api-parameters
}
search = GoogleSearch(params) # where data extraction happens
organic_results_data = []
page_num = 0
while True:
results = search.get_dict() # JSON -> Python dictionary
page_num += 1
for result in results["organic_results"]:
organic_results_data.append({
"title": result.get("title"),
"snippet": result.get("snippet"),
"link": result.get("link")
})
if "next_link" in results.get("serpapi_pagination", {}):
search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))
else:
break
print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))
Output: the same as in the bs4 solution.