BeautifulSoup and urllib not parsing google images page

Question

I am trying to use BeautifulSoup to find a random image from google images. My code looks like this.

import urllib, bs4, random
from urllib import request
urlname = "https://www.google.com/search?hl=en&q=" + str(random.        randrange(999999))  + "&ion=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.&bvm=bv.  42553238,d.dmg&biw=1354&bih=622&um=1&ie=UTF-                            8&tbm=isch&source=og&sa=N&tab=wi&ei=sNEfUf-fHvLx0wG7uoG4DQ"

page = bs4.BeautifulSoup(urllib.request.urlopen(urlname)

But whenever I try to get the HTML from the page object, I get:

urllib.error.HTTPError: HTTP Error 403: Forbidden

I test the URLs that are generated by pasting them into my web browser, and the browser doesn't return this error. What's going on?

score 1 · Answer 1 · answered Apr 16 '13 at 05:21

I am pretty sure that google is telling you: "Please don't do this". See this explanation of the http 403 error.

What is going on is that your python script, or more specifically urllib is sending headers, telling google that this is some kind of plain request, which is not coming from a browser.

Google is doing that rightfully so, since otherwise many people would simply scrape their website and show the google results as their own.

There are two solutions that I can see so far.

1) Use the google custom search API. It supports image search and has a free quota of 100 queries per day - for more queries you will have to pay.

2) Tools like mechanize are misleading websites, by telling them that they are browsers, and not in fact scraping bots, by e.g. sending manipulated headers. Common issues here are that if your scraper is too greedy(too many requests in a short interval) google will permanently block your IP address...

Dmitriy Zub · Answer 2 · 2021-10-01T05:54:17.823

That is because there's no user-agent specified. The default requests user-agent is python-requests thus Google blocks a request because it knows that it's a bot and not a "real" user visit and user-agent fakes it.

To scrape Google Images, both thumbnail and full resolution URL you need to parse the date from the page source inside <script> tags:

# find all <script> tags:
soup.select('script')

# match images data via regex:
matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))

# match desired images (full res size) via regex:

# https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
# if you try to json.loads() without json.dumps() it will throw an error:
# "Expecting property name enclosed in double quotes"
matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)

matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
                                                    matched_images_data_json)

# Extract and decode them using bytes() and decode():
for fixed_full_res_image in matched_google_full_resolution_images:
    original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
    original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')

Code and full example in the online IDE that downloads Google Images as well:

import requests, lxml, re, json, shutil, urllib.request
from bs4 import BeautifulSoup
from py_random_words import RandomWords

random_word = RandomWords().get_word()

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
    "q": random_word,
    "tbm": "isch", 
    "hl": "en",
    "ijn": "0",
}

html = requests.get("https://www.google.com/search", params=params, headers=headers)
soup = BeautifulSoup(html.text, 'lxml')


def get_images_data():

    print('\nGoogle Images Metadata:')
    for google_image in soup.select('.isv-r.PNCib.MSM1fd.BUooTd'):
        title = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['title']
        source = google_image.select_one('.fxgdke').text
        link = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['href']
        print(f'{title}\n{source}\n{link}\n')

    # this steps could be refactored to a more compact
    all_script_tags = soup.select('script')

    # # https://regex101.com/r/48UZhY/4
    matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
    
    # https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
    # if you try to json.loads() without json.dumps() it will throw an error:
    # "Expecting property name enclosed in double quotes"
    matched_images_data_fix = json.dumps(matched_images_data)
    matched_images_data_json = json.loads(matched_images_data_fix)

    # https://regex101.com/r/pdZOnW/3
    matched_google_image_data = re.findall(r'\[\"GRID_STATE0\",null,\[\[1,\[0,\".*?\",(.*),\"All\",', matched_images_data_json)

    # https://regex101.com/r/NnRg27/1
    matched_google_images_thumbnails = ', '.join(
        re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
                   str(matched_google_image_data))).split(', ')

    print('Google Image Thumbnails:')  # in order
    for fixed_google_image_thumbnail in matched_google_images_thumbnails:
        # https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
        google_image_thumbnail_not_fixed = bytes(fixed_google_image_thumbnail, 'ascii').decode('unicode-escape')

        # after first decoding, Unicode characters are still present. After the second iteration, they were decoded.
        google_image_thumbnail = bytes(google_image_thumbnail_not_fixed, 'ascii').decode('unicode-escape')
        print(google_image_thumbnail)

    # removing previously matched thumbnails for easier full resolution image matches.
    removed_matched_google_images_thumbnails = re.sub(
        r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', '', str(matched_google_image_data))

    # https://regex101.com/r/fXjfb1/4
    # https://stackoverflow.com/a/19821774/15164646
    matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
                                                       removed_matched_google_images_thumbnails)


    print('\nFull Resolution Images:')  # in order
    for index, fixed_full_res_image in enumerate(matched_google_full_resolution_images):
        # https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
        original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
        original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
        print(original_size_img)

        # ------------------------------------------------
        # Download original images

        print(f'Downloading {index} image...')
        
        opener=urllib.request.build_opener()
        opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
        urllib.request.install_opener(opener)

        urllib.request.urlretrieve(original_size_img, f'Bs4_Images/original_size_img_{index}.jpg')

    
get_images_data()

Alternatively, you can achieve the same thing by using Google Images API from SerpApi. It's a paid API with a free plan.

The difference in your case is that you don't have to deal with regex to match and extract needed data from the page source, instead, you only need to iterate over structured JSON and get what you want, fast, and don't need to maintain it over time.

Code to integrate to achieve your goal:

import os, urllib.request, json # json for pretty output
from serpapi import GoogleSearch
from py_random_words import RandomWords


random_word = RandomWords().get_word()

def get_google_images():
    params = {
      "api_key": os.getenv("API_KEY"),
      "engine": "google",
      "q": random_word,
      "tbm": "isch"
    }

    search = GoogleSearch(params)
    results = search.get_dict()

    print(json.dumps(results['images_results'], indent=2, ensure_ascii=False))

    # -----------------------
    # Downloading images

    for index, image in enumerate(results['images_results']):

        print(f'Downloading {index} image...')
        
        opener=urllib.request.build_opener()
        opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
        urllib.request.install_opener(opener)

        urllib.request.urlretrieve(image['original'], f'SerpApi_Images/original_size_img_{index}.jpg')


get_google_images()

P.S - I wrote a more in-depth blog post about how to scrape Google Images, and how to reduce the chance of being blocked while web scraping search engines.

Disclaimer, I work for SerpApi.

BeautifulSoup and urllib not parsing google images page

2 Answers2