Trying to search for images using Google Search, error 400

Question

I keep getting this error:urllib.error.HTTPError: HTTP Error 400: Bad Request

I believe it may have something to do with the links, since when I put them in (and replace the {}) I receive the same error, but I don't know which links are correct/ (Python 3.6, Anaconda)

import os
import urllib.request as ulib
from bs4 import BeautifulSoup as Soup
import json

url_a = 'https://www.google.com/search?ei=1m7NWePfFYaGmQG51q7IBg&hl=en&q={}'
url_b = '\&tbm=isch&ved=0ahUKEwjjovnD7sjWAhUGQyYKHTmrC2kQuT0I7gEoAQ&start={}'
url_c = '\&yv=2&vet=10ahUKEwjjovnD7sjWAhUGQyYKHTmrC2kQuT0I7gEoAQ.1m7NWePfFYaGmQG51q7IBg'
url_d = '\.i&ijn=1&asearch=ichunk&async=_id:rg_s,_pms:s'
url_base = ''.join((url_a, url_b, url_c, url_d))

headers = {'User-Agent': 'Chrome/69.0.3497.100'}

def get_links(search_name):
    search_name = search_name.replace(' ', '+')
    url = url_base.format(search_name, 0)
    request = ulib.Request(url, data=None, headers=headers)
    json_string = ulib.urlopen(request).read()
    page = json.loads(json_string)
    new_soup = Soup(page[1][1], 'lxml')
    images = new_soup.find_all('img')
    links = [image['src'] for image in images]
    return links

if __name__ == '__main__':
    search_name = 'Thumbs up'
    links = get_links(search_name)

    for link in links:
        print(link)

if you want a json data as response, read this https://developers.google.com/custom-search/v1/ — KC., Oct 25 '18 at 06:51

score 0 · Answer 1 · answered Oct 24 '18 at 14:25

0

I think you have a bunch of params you don't need

Try this simpler URL for image searching:

https://www.google.com/search?q={KEY_WORD}&tbm=isch

For example:

https://www.google.com/search?q=apples&tbm=isch

answered Oct 24 '18 at 14:25

LeKhan9

1,300
1
5
15

It opens a web page, while in his example it `should` get json, but instead it gets strange html. – BladeMight Oct 24 '18 at 14:28
Ah gotcha, I thought OP wanted raw webpage – LeKhan9 Oct 24 '18 at 14:36

score 0 · Answer 2 · answered Oct 24 '18 at 14:26

I think the problem is in asearch=ichunk&async=_id:rg_s,_pms:s which cannot be used with search, if i remove them it works:

import os
import urllib.request as ulib
from bs4 import BeautifulSoup as Soup
import json

url_a = 'https://www.google.com/search?ei=1m7NWePfFYaGmQG51q7IBg&hl=en&q=a+mouse'
url_b = '\&tbm=isch&ved=0ahUKEwjjovnD7sjWAhUGQyYKHTmrC2kQuT0I7gEoAQ&start={}'
url_c = '\&yv=2&vet=10ahUKEwjjovnD7sjWAhUGQyYKHTmrC2kQuT0I7gEoAQ.1m7NWePfFYaGmQG51q7IBg'
url_d = '\.i&ijn=1'
url_base = ''.join((url_a, url_b, url_c, url_d))
print(url_base);

headers = {'User-Agent': 'Chrome/69.0.3497.100'}

def get_links(search_name):
    search_name = search_name.replace(' ', '+')
    url = url_base.format(search_name, 0)
    request = ulib.Request(url, data=None, headers=headers)
    json_string = ulib.urlopen(request).read()
    print(json_string)
    page = json.loads(json_string)
    new_soup = Soup(page[1][1], 'lxml')
    images = new_soup.find_all('img')
    links = [image['src'] for image in images]
    return links

if __name__ == '__main__':
    search_name = 'Thumbs up'
    links = get_links(search_name)

    for link in links:
        print(link)

Dmitriy Zub · Answer 3 · 2021-08-26T16:13:10.180

I'm not really sure what you were trying to do by scraping JSON data with beautifulsoup since it can't do it. Instead, you can prase <script> tags that might contain JSON data via re module and then iterate over parsed JSON string.

Have a look at requsets library. You can get a more easy to read code by only adding needed query parameters (already mentioned by LeKhan9) in say, params (dict) variable and then pass it into request.get() just like you did with headers like so:

params = {
    "q": "minecraft lasagna skin",
    "tbm": "isch",
    "ijn": "0", # batch of 100 images
}

request.get(URL, params=params)

Code and full example in the online IDE that scrapes suggested search results at the top as well (try to read step-by-step, it's pretty straightforward):

import requests, lxml, re, json
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
    "q": "minecraft lasagna skin",
    "tbm": "isch",
    "ijn": "0",
}

html = requests.get("https://www.google.com/search", params=params, headers=headers)
soup = BeautifulSoup(html.text, 'lxml')

print('\nGoogle Images Metadata:')
for google_image in soup.select('.isv-r.PNCib.MSM1fd.BUooTd'):
    title = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['title']
    source = google_image.select_one('.fxgdke').text
    link = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['href']
    print(f'{title}\n{source}\n{link}\n')

# this steps could be refactored to a more compact
all_script_tags = soup.select('script')

# # https://regex101.com/r/48UZhY/4
matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))

# https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
# if you try to json.loads() without json.dumps it will throw an error:
# "Expecting property name enclosed in double quotes"
matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)

# https://regex101.com/r/pdZOnW/3
matched_google_image_data = re.findall(r'\[\"GRID_STATE0\",null,\[\[1,\[0,\".*?\",(.*),\"All\",', matched_images_data_json)

# https://regex101.com/r/NnRg27/1
matched_google_images_thumbnails = ', '.join(
    re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
                str(matched_google_image_data))).split(', ')

print('Google Image Thumbnails:')  # in order
for fixed_google_image_thumbnail in matched_google_images_thumbnails:
    # https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
    google_image_thumbnail_not_fixed = bytes(fixed_google_image_thumbnail, 'ascii').decode('unicode-escape')

    # after first decoding, Unicode characters are still present. After the second iteration, they were decoded.
    google_image_thumbnail = bytes(google_image_thumbnail_not_fixed, 'ascii').decode('unicode-escape')
    print(google_image_thumbnail)

# removing previously matched thumbnails for easier full resolution image matches.
removed_matched_google_images_thumbnails = re.sub(
    r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', '', str(matched_google_image_data))

# https://regex101.com/r/fXjfb1/4
# https://stackoverflow.com/a/19821774/15164646
matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
                                                    removed_matched_google_images_thumbnails)

print('\nGoogle Full Resolution Images:')  # in order
for fixed_full_res_image in matched_google_full_resolution_images:
    # https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
    original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
    original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
    print(original_size_img)


----------------
'''
Google Images Metadata:
Lasagna Minecraft Skins | Planet Minecraft Community
planetminecraft.com
https://www.planetminecraft.com/skins/tag/lasagna/
...
Google Image Thumbnails:
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSPttXb_7ClNBirfv2Beh4aOBjlc-7Jw_kY8pZ4DrkbAavZcJEtz8djo_9iqdnatiG6Krw&usqp=CAU
...
Google Full Resolution Images:
https://static.planetminecraft.com/files/resource_media/preview/skinLasagnaman_minecraft_skin-6204972.jpg
...
'''

Alternatively, you can achieve this using Google Images API from SerpApi. It's a paid API with a free plan.

The biggest and noticeable difference is that you only need to iterate over structured JSON with already parsed data without the need to figure why something isn't parsing properly. Check out the playground.

Code to integrate:

import os, json # json for pretty output
from serpapi import GoogleSearch

params = {
  "api_key": os.getenv("API_KEY"),
  "engine": "google",
  "q": "minecraft shaders 8k photo",
  "tbm": "isch"
}

search = GoogleSearch(params)
results = search.get_dict()

print(json.dumps(results['suggested_searches'], indent=2, ensure_ascii=False))
print(json.dumps(results['images_results'], indent=2, ensure_ascii=False)

-----------
# same output as above but in JSON format

I wrote a blog post on how to scrape Google Images in a bit more detailed way.

Dislaimer, I work for SerpApi.

Trying to search for images using Google Search, error 400

3 Answers3

Linked