You can extract Google Images using regular expressions because the data you need renders dynamically but we can find it in the inline JSON. It faster method than using browser automation.
To do that, we can search for the first image title in the page source (Ctrl+U
) to find the matches we need and if there are any in the <script>>
elements, then it is most likely an inline JSON. From there we can extract data.
To find the original images, we first need to find the thumbnails. After that we need to subtract part of the parsed Inline JSON which will give an easier way to parse the original resolution images:
# https://regex101.com/r/SxwJsW/1
matched_google_images_thumbnails = ", ".join(
re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
str(matched_google_image_data))).split(", ")
thumbnails = [bytes(bytes(thumbnail, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for thumbnail in matched_google_images_thumbnails]
# removing previously matched thumbnails for easier full resolution image matches.
removed_matched_google_images_thumbnails = re.sub(
r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', "", str(matched_google_image_data))
# https://regex101.com/r/fXjfb1/4
# https://stackoverflow.com/a/19821774/15164646
matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]", removed_matched_google_images_thumbnails)
Unfortunately, this method does not make it possible to find absolutely all the pictures, since they are added to the page using scrolling. In case you need to collect absolutely all the pictures, you need use browser automation, such as selenium
or playwright
if you don't want to reverse engineer it.
There's a "ijn" URL parameter
that defines the page number to get (greater than or equal to 0). It used in combination with pagination token that also located in the Inline JSON.
Check code in online IDE.
import requests, re, json, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"
}
google_images = []
params = {
"q": "tower", # search query
"tbm": "isch", # image results
"hl": "en", # language of the search
"gl": "us" # country where search comes fro
}
html = requests.get("https://google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
all_script_tags = soup.select("script")
# https://regex101.com/r/RPIbXK/1
matched_images_data = "".join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)
# https://regex101.com/r/NRKEmV/1
matched_google_image_data = re.findall(r'\"b-GRID_STATE0\"(.*)sideChannel:\s?{}}', matched_images_data_json)
# https://regex101.com/r/SxwJsW/1
matched_google_images_thumbnails = ", ".join(
re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
str(matched_google_image_data))).split(", ")
thumbnails = [bytes(bytes(thumbnail, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for thumbnail in matched_google_images_thumbnails]
# removing previously matched thumbnails for easier full resolution image matches.
removed_matched_google_images_thumbnails = re.sub(
r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', "", str(matched_google_image_data))
# https://regex101.com/r/fXjfb1/4
# https://stackoverflow.com/a/19821774/15164646
matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]", removed_matched_google_images_thumbnails)
full_res_images = [
bytes(bytes(img, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for img in matched_google_full_resolution_images
]
for index, (metadata, thumbnail, original) in enumerate(zip(soup.select('.isv-r.PNCib.MSM1fd.BUooTd'), thumbnails, full_res_images), start=1):
google_images.append({
"title": metadata.select_one(".VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb")["title"],
"link": metadata.select_one(".VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb")["href"],
"source": metadata.select_one(".fxgdke").text,
"thumbnail": thumbnail,
"original": original
})
print(json.dumps(google_images, indent=2, ensure_ascii=False))
Example output:
[
{
"title": "Eiffel Tower - Wikipedia",
"link": "https://en.wikipedia.org/wiki/Eiffel_Tower",
"source": "Wikipedia",
"thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTsuYzf9os1Qb1ssPO6fWn-5Jm6ASDXAxUFYG6eJfvmehywH-tJEXDW0t7XLR3-i8cNd-0&usqp=CAU",
"original": "https://upload.wikimedia.org/wikipedia/commons/thumb/8/85/Tour_Eiffel_Wikimedia_Commons_%28cropped%29.jpg/640px-Tour_Eiffel_Wikimedia_Commons_%28cropped%29.jpg"
},
{
"title": "tower | architecture | Britannica",
"link": "https://www.britannica.com/technology/tower",
"source": "Encyclopedia Britannica",
"thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR8EsWofNiFTe6alwRlwXVR64RdWTG2fuBQ0z1FX4tg3HbL7Mxxvz6GnG1rGZQA8glVNA4&usqp=CAU",
"original": "https://cdn.britannica.com/51/94351-050-86B70FE1/Leaning-Tower-of-Pisa-Italy.jpg"
},
{
"title": "Tower - Wikipedia",
"link": "https://en.wikipedia.org/wiki/Tower",
"source": "Wikipedia",
"thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcT3L9LA0VamqmevhCtkrHZvM9MlBf9EjtTT7KhyzRP3zi3BmuCOmn0QFQG42xFfWljcsho&usqp=CAU",
"original": "https://upload.wikimedia.org/wikipedia/commons/3/3e/Tokyo_Sky_Tree_2012.JPG"
},
# ...
]
Also you can use Google Images API from SerpApi. It's a paid API with the free plan.
The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.
Simple code example:
from serpapi import GoogleSearch
import os, json
image_results = []
# search query parameters
params = {
"engine": "google", # search engine. Google, Bing, Yahoo, Naver, Baidu...
"q": "tower", # search query
"tbm": "isch", # image results
"num": "100", # number of images per page
"ijn": 0, # page number: 0 -> first page, 1 -> second...
"api_key": os.getenv("API_KEY") # your serpapi api key
# other query parameters: hl (lang), gl (country), etc
}
search = GoogleSearch(params) # where data extraction happens
images_is_present = True
while images_is_present:
results = search.get_dict() # JSON -> Python dictionary
# checks for "Google hasn't returned any results for this query."
if "error" not in results:
for image in results["images_results"]:
if image["original"] not in image_results:
image_results.append(image["original"])
# update to the next page
params["ijn"] += 1
else:
images_is_present = False
print(results["error"])
print(json.dumps(image_results, indent=2))
Output:
[
"https://cdn.rt.emap.com/wp-content/uploads/sites/4/2022/08/10084135/shutterstock-woods-bagot-rough-site-for-leadenhall-tower.jpg",
"https://dynamic-media-cdn.tripadvisor.com/media/photo-o/1c/60/ff/c5/ambuluwawa-tower-is-the.jpg?w=1200&h=-1&s=1",
"https://cdn11.bigcommerce.com/s-bf3bb/product_images/uploaded_images/find-your-nearest-cell-tower-in-five-minutes-or-less.jpeg",
"https://s3.amazonaws.com/reuniontower/Reunion-Tower-Exterior-Skyline.jpg",
"https://assets2.rockpapershotgun.com/minecraft-avengers-tower.jpg/BROK/resize/1920x1920%3E/format/jpg/quality/80/minecraft-avengers-tower.jpg",
"https://images.adsttc.com/media/images/52ab/5834/e8e4/4e0f/3700/002e/large_jpg/PERTAMINA_1_Tower_from_Roundabout.jpg?1386960835",
"https://awoiaf.westeros.org/images/7/78/The_tower_of_joy_by_henning.jpg",
"https://eu-assets.simpleview-europe.com/plymouth2016/imageresizer/?image=%2Fdmsimgs%2Fsmeatontower3_606363908.PNG&action=ProductDetailNew",
# ...
]
There's a Scrape and download Google Images with Python blog post if you need a little bit more code explanation.