-2

I am trying to find copyright-free images on Google but I am unable to get the correct image URLs. My code applies the correct filter and directs me to the right page but it retrieves the URLs for images without the copyright-free and size filter, I am unsure why. Thank you in advance.

import requests
import urllib.request
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request

url = 'https://google.com/search?q='
input = 'cat'
#string: tbm=isch --> means image search
#tbs=isz:m --> size medium
#il:cl --> copy right free(i think)
url = url+input+'&tbm=isch&tbs=isz:m%2Cil:cl'
print(url)
html = urlopen(Request(url, headers={'User-Agent': 'Google Chrome'}))
'''with urllib.request.urlopen(url) as response:
    html = response.read()
    print(html)'''
#print(str(r.content))

soup = BeautifulSoup(html.read(),'html.parser')

#using soup to find all img tags
results = soup.find_all('img')
str_result = str(results)

lst_result = str_result.split(',')
#trying to get the first link for the images with the appropriate settings
link = lst_result[4].split(' ')[4].split('"')[1]

# writing into the appropriate testing file, to be changed
file = open('.img1.png','wb')
get_img = requests.get(link)
file.write(get_img.content)
file.close()
enigma312
  • 29
  • 9
  • Are the urls that it returns from that same page? – joey Jul 29 '21 at 04:15
  • @joey no they are not – enigma312 Jul 29 '21 at 16:26
  • which page are they coming from? According to your code, they must come from `url+input+'&tbm=isch&tbs=isz:m%2Cil:cl'`... if I were you I would double check that page with inspect element. You will solve the problem once you know where the images are coming from – joey Jul 29 '21 at 16:50
  • @joey they are coming from the default page without the copyright-free filter but I do not understand why – enigma312 Jul 29 '21 at 19:02
  • Where did you get the url from? When I used advanced search I got `https://www.google.com/search?as_st=y&tbm=isch&hl=en&as_q=cat&as_epq=&as_oq=&as_eq=&imgsz=&imgar=&imgc=&imgcolor=&imgtype=&cr=&as_sitesearch=&safe=images&as_filetype=&tbs=sur%3Acl` for cat and Creative common liscense – joey Jul 29 '21 at 19:48
  • @joey yes that is the link I hoping the url comes from but instead it comes from [link](https://www.google.com/search?q=cat&sxsrf=ALeKk00GixyUkcM6QGTp1u-OPmv8Vm25Ew:1627603692185&source=lnms&tbm=isch&sa=X&ved=2ahUKEwj88qCfwInyAhVpKVkFHXR2BnoQ_AUoAXoECAEQAw&biw=1309&bih=707) – enigma312 Jul 30 '21 at 00:08
  • Does this answer your question? [Python - Download Images from google Image search?](https://stackoverflow.com/questions/20716842/python-download-images-from-google-image-search) – ilyazub Aug 25 '21 at 14:52

1 Answers1

0

You can try to use a bit easier approach without specifying tbs=il:cl param and playing a guessing game which images are for sure is licensed under Creative Commons by searching: "pexels cat" or "unsplash cat".

Or, you can try to add a filter param (tbs=il:cl) plus pexels/unsplash at the beginning of the query.

These images are completely free by default since these websites are designed to provide free images for commercial or non-commercial use, and Google will show results only from those websites.


To find and extract the original image URL you need to parse it from the <script> tag via regex.

Firstly you need to find all script tags using bs4:

soup.select('script')

Secondly, to match a desired pattern using regex:

# one of the regex patterns to find original size URL
re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]", SOME_VARIABLE)

Thirdly, iterate over matches, extract and decode each URL one by one:

for SOME_VARIABLE in SOME_VARIABLE:
    # it needs to be decoded twice.
    # otherwise Unicode characters will be still present after the first decode.
    # yes, it is stupid. 
    original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
    original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')

Code and full example in the online IDE that scrapes more:

import requests, lxml, re, json
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
    "q": "pexels cat",
    "tbm": "isch",
    "tbs": "il:cl",
    "hl": "en",
    "ijn": "0",
}

html = requests.get("https://www.google.com/search", params=params, headers=headers)
soup = BeautifulSoup(html.text, 'lxml')


def get_images_data():

    print('\nGoogle Images Metadata:')
    for google_image in soup.select('.isv-r.PNCib.MSM1fd.BUooTd'):
        title = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['title']
        source = google_image.select_one('.fxgdke').text
        link = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['href']
        print(f'{title}\n{source}\n{link}\n')

    # this steps could be refactored to a more compact
    all_script_tags = soup.select('script')

    # # https://regex101.com/r/48UZhY/4
    matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
    
    # https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
    # if you try to json.loads() without json.dumps it will throw an error:
    # "Expecting property name enclosed in double quotes"
    matched_images_data_fix = json.dumps(matched_images_data)
    matched_images_data_json = json.loads(matched_images_data_fix)

    # https://regex101.com/r/pdZOnW/3
    matched_google_image_data = re.findall(r'\[\"GRID_STATE0\",null,\[\[1,\[0,\".*?\",(.*),\"All\",', matched_images_data_json)

    # https://regex101.com/r/NnRg27/1
    matched_google_images_thumbnails = ', '.join(
        re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
                   str(matched_google_image_data))).split(', ')

    print('Google Image Thumbnails:')  # in order
    for fixed_google_image_thumbnail in matched_google_images_thumbnails:
        # https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
        google_image_thumbnail_not_fixed = bytes(fixed_google_image_thumbnail, 'ascii').decode('unicode-escape')

        # after first decoding, Unicode characters are still present. After the second iteration, they were decoded.
        google_image_thumbnail = bytes(google_image_thumbnail_not_fixed, 'ascii').decode('unicode-escape')
        print(google_image_thumbnail)

    # removing previously matched thumbnails for easier full resolution image matches.
    removed_matched_google_images_thumbnails = re.sub(
        r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', '', str(matched_google_image_data))

    # https://regex101.com/r/fXjfb1/4
    # https://stackoverflow.com/a/19821774/15164646
    matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
                                                       removed_matched_google_images_thumbnails)

    print('\nGoogle Full Resolution Images:')  # in order
    for fixed_full_res_image in matched_google_full_resolution_images:
        # https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
        original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
        original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
        print(original_size_img)

get_images_data()

--------------
'''
Google Images Metadata:
9,000+ Best Cat Photos · 100% Free Download · Pexels Stock Photos
pexels.com
https://www.pexels.com/search/cat/
...

Google Image Thumbnails:
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSb48h3zks_bf6y7HnZGyGPn3s2TAHKKm_7kzxufi5nzbouJcQderHqoEoOZ4SpOuPDjfw&usqp=CAU
...

Google Full Resolution Images:
https://images.pexels.com/photos/1170986/pexels-photo-1170986.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500
...
'''

Alternatively, you can skip this process by using Google Images API from SerpApi. It's a paid API with a free plan.

The main difference is that you only need to iterate over structured JSON since everything else is already done for the end-user.

Code to integrate:

import os, json # json for pretty output
from serpapi import GoogleSearch


def get_google_images():
    params = {
      "api_key": os.getenv("API_KEY"),
      "engine": "google",
      "q": "minecraft shaders 8k photo",
      "tbm": "isch"
    }

    search = GoogleSearch(params)
    results = search.get_dict()

    print(json.dumps(results['images_results'], indent=2, ensure_ascii=False))

----------
'''
...
  {
    "position": 60, # img number
    "thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRt-tXSZMBNLLX8MhavbBNkKmjJ7wNXxtdr5Q&usqp=CAU",
    "source": "pexels.com",
    "title": "1,000+ Best Cats Videos · 100% Free Download · Pexels Stock Videos",
    "link": "https://www.pexels.com/search/videos/cats/",
    "original": "https://images.pexels.com/videos/855282/free-video-855282.jpg?auto=compress&cs=tinysrgb&dpr=1&w=500",
    "is_product": false
  }
...
'''

P.S. - I wrote a blog post about scraping Google Images which covers it more in-depth with visual representation.

Disclaimer, I work for SerpApi.

Dmitriy Zub
  • 1,398
  • 8
  • 35