0

I would like to download bulk images, using Google image search.

My first method; downloading the page source to a file and then opening it with open() works fine, but I would like to be able to fetch image urls by just running the script and changing keywords.

First method: Go to the image search (https://www.google.no/search?q=tower&client=opera&hs=UNl&source=lnms&tbm=isch&sa=X&ved=0ahUKEwiM5fnf4_zKAhWIJJoKHYUdBg4Q_AUIBygB&biw=1920&bih=982). View the page source in the browser and save it to a html file. When I then open() that html file with the script, the script works as expected and I get a neat list of all the urls of the images on the search page. This is what line 6 of the script does (uncomment to test).

If, however I use the requests.get() function to parse the webpage, as shown in line 7 of the script, it fetches a different html document, that does not contain the full urls of the images, so I cannot extract them.

Please help me extract the correct urls of the images.

Edit: link to the tower.html, I am using: https://www.dropbox.com/s/yy39w1oc8sjkp3u/tower.html?dl=0

This is the code, I have written so far:

import requests
from bs4 import BeautifulSoup

# define the url to be scraped
url = 'https://www.google.no/search?q=tower&client=opera&hs=cTQ&source=lnms&tbm=isch&sa=X&ved=0ahUKEwig3LOx4PzKAhWGFywKHZyZAAgQ_AUIBygB&biw=1920&bih=982'

# top line is using the attached "tower.html" as source, bottom line is using the url. The html file contains the source of the above url.
#page = open('tower.html', 'r').read()
page = requests.get(url).text

# parse the text as html
soup = BeautifulSoup(page, 'html.parser')

# iterate on all "a" elements.
for raw_link in soup.find_all('a'):
   link = raw_link.get('href')
   # if the link is a string and contain "imgurl" (there are other links on the page, that are not interesting...
   if type(link) == str and 'imgurl' in link:
        # print the part of the link that is between "=" and "&" (which is the actual url of the image,
        print(link.split('=')[1].split('&')[0])
  • Possibly related: https://stackoverflow.com/a/22871658/1291371, https://stackoverflow.com/a/35810041/1291371, https://stackoverflow.com/a/28487500/1291371 – ilyazub Mar 27 '20 at 18:55

3 Answers3

2

You can extract Google Images using regular expressions because the data you need renders dynamically but we can find it in the inline JSON. It faster method than using browser automation.

To do that, we can search for the first image title in the page source (Ctrl+U) to find the matches we need and if there are any in the <script>> elements, then it is most likely an inline JSON. From there we can extract data.

To find the original images, we first need to find the thumbnails. After that we need to subtract part of the parsed Inline JSON which will give an easier way to parse the original resolution images:

# https://regex101.com/r/SxwJsW/1
matched_google_images_thumbnails = ", ".join(
    re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
                       str(matched_google_image_data))).split(", ")
    
thumbnails = [bytes(bytes(thumbnail, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for thumbnail in matched_google_images_thumbnails]
    
# removing previously matched thumbnails for easier full resolution image matches.
removed_matched_google_images_thumbnails = re.sub(
        r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', "", str(matched_google_image_data))
    
# https://regex101.com/r/fXjfb1/4
# https://stackoverflow.com/a/19821774/15164646
matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]", removed_matched_google_images_thumbnails)

Unfortunately, this method does not make it possible to find absolutely all the pictures, since they are added to the page using scrolling. In case you need to collect absolutely all the pictures, you need use browser automation, such as selenium or playwright if you don't want to reverse engineer it.

There's a "ijn" URL parameter that defines the page number to get (greater than or equal to 0). It used in combination with pagination token that also located in the Inline JSON.

Check code in online IDE.

import requests, re, json, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"
}

google_images = []

params = {    
    "q": "tower",              # search query
    "tbm": "isch",             # image results
    "hl": "en",                # language of the search
    "gl": "us"                # country where search comes fro
}
    
html = requests.get("https://google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
    
all_script_tags = soup.select("script")
    
# https://regex101.com/r/RPIbXK/1
matched_images_data = "".join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
    
matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)
      
# https://regex101.com/r/NRKEmV/1
matched_google_image_data = re.findall(r'\"b-GRID_STATE0\"(.*)sideChannel:\s?{}}', matched_images_data_json)
    
# https://regex101.com/r/SxwJsW/1
matched_google_images_thumbnails = ", ".join(
    re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
                       str(matched_google_image_data))).split(", ")
    
thumbnails = [bytes(bytes(thumbnail, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for thumbnail in matched_google_images_thumbnails]
    
# removing previously matched thumbnails for easier full resolution image matches.
removed_matched_google_images_thumbnails = re.sub(
        r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', "", str(matched_google_image_data))
    
# https://regex101.com/r/fXjfb1/4
# https://stackoverflow.com/a/19821774/15164646
matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]", removed_matched_google_images_thumbnails)
    
full_res_images = [
        bytes(bytes(img, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for img in matched_google_full_resolution_images
]
        
for index, (metadata, thumbnail, original) in enumerate(zip(soup.select('.isv-r.PNCib.MSM1fd.BUooTd'), thumbnails, full_res_images), start=1):
    google_images.append({
        "title": metadata.select_one(".VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb")["title"],
        "link": metadata.select_one(".VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb")["href"],
        "source": metadata.select_one(".fxgdke").text,
        "thumbnail": thumbnail,
        "original": original
    })

print(json.dumps(google_images, indent=2, ensure_ascii=False))

Example output:

[
  {
    "title": "Eiffel Tower - Wikipedia",
    "link": "https://en.wikipedia.org/wiki/Eiffel_Tower",
    "source": "Wikipedia",
    "thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTsuYzf9os1Qb1ssPO6fWn-5Jm6ASDXAxUFYG6eJfvmehywH-tJEXDW0t7XLR3-i8cNd-0&usqp=CAU",
    "original": "https://upload.wikimedia.org/wikipedia/commons/thumb/8/85/Tour_Eiffel_Wikimedia_Commons_%28cropped%29.jpg/640px-Tour_Eiffel_Wikimedia_Commons_%28cropped%29.jpg"
  },
  {
    "title": "tower | architecture | Britannica",
    "link": "https://www.britannica.com/technology/tower",
    "source": "Encyclopedia Britannica",
    "thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR8EsWofNiFTe6alwRlwXVR64RdWTG2fuBQ0z1FX4tg3HbL7Mxxvz6GnG1rGZQA8glVNA4&usqp=CAU",
    "original": "https://cdn.britannica.com/51/94351-050-86B70FE1/Leaning-Tower-of-Pisa-Italy.jpg"
  },
  {
    "title": "Tower - Wikipedia",
    "link": "https://en.wikipedia.org/wiki/Tower",
    "source": "Wikipedia",
    "thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcT3L9LA0VamqmevhCtkrHZvM9MlBf9EjtTT7KhyzRP3zi3BmuCOmn0QFQG42xFfWljcsho&usqp=CAU",
    "original": "https://upload.wikimedia.org/wikipedia/commons/3/3e/Tokyo_Sky_Tree_2012.JPG"
  },
  # ...
]

Also you can use Google Images API from SerpApi. It's a paid API with the free plan. The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.

Simple code example:

from serpapi import GoogleSearch
import os, json

image_results = []
   
# search query parameters
params = {
    "engine": "google",               # search engine. Google, Bing, Yahoo, Naver, Baidu...
    "q": "tower",                     # search query
    "tbm": "isch",                    # image results
    "num": "100",                     # number of images per page
    "ijn": 0,                         # page number: 0 -> first page, 1 -> second...
    "api_key": os.getenv("API_KEY")   # your serpapi api key
                                      # other query parameters: hl (lang), gl (country), etc  
}
    
search = GoogleSearch(params)         # where data extraction happens
    
images_is_present = True
while images_is_present:
    results = search.get_dict()       # JSON -> Python dictionary
    
# checks for "Google hasn't returned any results for this query."
    if "error" not in results:
        for image in results["images_results"]:
            if image["original"] not in image_results:
                    image_results.append(image["original"])
                
# update to the next page
        params["ijn"] += 1
    else:
        images_is_present = False
        print(results["error"])

print(json.dumps(image_results, indent=2))

Output:

[
  "https://cdn.rt.emap.com/wp-content/uploads/sites/4/2022/08/10084135/shutterstock-woods-bagot-rough-site-for-leadenhall-tower.jpg",
  "https://dynamic-media-cdn.tripadvisor.com/media/photo-o/1c/60/ff/c5/ambuluwawa-tower-is-the.jpg?w=1200&h=-1&s=1",
  "https://cdn11.bigcommerce.com/s-bf3bb/product_images/uploaded_images/find-your-nearest-cell-tower-in-five-minutes-or-less.jpeg",
  "https://s3.amazonaws.com/reuniontower/Reunion-Tower-Exterior-Skyline.jpg",
  "https://assets2.rockpapershotgun.com/minecraft-avengers-tower.jpg/BROK/resize/1920x1920%3E/format/jpg/quality/80/minecraft-avengers-tower.jpg",
  "https://images.adsttc.com/media/images/52ab/5834/e8e4/4e0f/3700/002e/large_jpg/PERTAMINA_1_Tower_from_Roundabout.jpg?1386960835",
  "https://awoiaf.westeros.org/images/7/78/The_tower_of_joy_by_henning.jpg",
  "https://eu-assets.simpleview-europe.com/plymouth2016/imageresizer/?image=%2Fdmsimgs%2Fsmeatontower3_606363908.PNG&action=ProductDetailNew",
  # ...
]

There's a Scrape and download Google Images with Python blog post if you need a little bit more code explanation.

Denis Skopa
  • 1
  • 1
  • 1
  • 7
1

Just so you're aware:

# http://www.google.com/robots.txt

User-agent: *
Disallow: /search



I would like to preface my answer by saying that Google heavily relies on scripting. It's very possible that you're getting different results because the page you're requesting via reqeusts doesn't do anything with the scripts supplied in on the page, whereas loading the page in a web browser does.

Here's what i get when I request the url you supplied

The text I get back from requests.get(url).text doesn't contain 'imgurl' in it anywhere. Your script is looking for that as part of its criteria and it's not there.

I do however see a bunch of <img> tags with the src attribute set to an image url. If that's what you're after, than try this script:

import requests
from bs4 import BeautifulSoup

url = 'https://www.google.no/search?q=tower&client=opera&hs=cTQ&source=lnms&tbm=isch&sa=X&ved=0ahUKEwig3LOx4PzKAhWGFywKHZyZAAgQ_AUIBygB&biw=1920&bih=982'

# page = open('tower.html', 'r').read()
page = requests.get(url).text

soup = BeautifulSoup(page, 'html.parser')

for raw_img in soup.find_all('img'):
  link = raw_img.get('src')
  if link:
    print(link)

Which returns the following results:

https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQyxRHrFw0NM-ZcygiHoVhY6B6dWwhwT4va727380n_IekkU9sC1XSddAg
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRfuhcCcOnC8DmOfweuWMKj3cTKXHS74XFh9GYAPhpD0OhGiCB7Z-gidkVk
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSOBZ9iFTXR8sGYkjWwPG41EO5Wlcv2rix0S9Ue1HFcts4VcWMrHkD5y10
https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcTEAZM3UoqqDCgcn48n8RlhBotSqvDLcE1z11y9n0yFYw4MrUFucPTbQ0Ma
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcSJvthsICJuYCKfS1PaKGkhfjETL22gfaPxqUm0C2-LIH9HP58tNap7bwc
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQGNtqD1NOwCaEWXZgcY1pPxQsdB8Z2uLGmiIcLLou6F_1c55zylpMWvSo
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSdRxvQjm4KWaxhAnJx2GNwTybrtUYCcb_sPoQLyAde2KMBUhR-65cm55I
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcQLVqQ7HLzD7C-mZYQyrwBIUjBRl8okRDcDoeQE-AZ2FR0zCPUfZwQ8Q20
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQHNByVCZzjSuMXMd-OV7RZI0Pj7fk93jVKSVs7YYgc_MsQqKu2v0EP1M0
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcS_RUkfpGZ1xJ2_7DCGPommRiIZOcXRi-63KIE70BHOb6uRk232TZJdGzc
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSxv4ckWM6eg_BtQlSkFP9hjRB6yPNn1pRyThz3D8MMaLVoPbryrqiMBvlZ
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQWv_dHMr5ZQzOj8Ort1gItvLgVKLvgm9qaSOi4Uomy13-gWZNcfk8UNO8
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcRRwzRc9BJpBQyqLNwR6HZ_oPfU1xKDh63mdfZZKV2lo1JWcztBluOrkt_o
https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQdGCT2h_O16OptH7OofZHNvtUhDdGxOHz2n8mRp78Xk-Oy3rndZ88r7ZA
https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRnmn9diX3Q08e_wpwOwn0N7L1QpnBep1DbUFXq0PbnkYXfO0wBy6fkpZY
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSaP9Ok5n6dL5K1yKXw0TtPd14taoQ0r3HDEwU5F9mOEGdvcIB0ajyqXGE
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTcyaCvbXLYRtFspKBe18Yy5WZ_1tzzeYD8Obb-r4x9Yi6YZw83SfdOF5fm
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTnS1qCjeYrbUtDSUNcRhkdO3fc3LTtN8KaQm-rFnbj_JagQEPJRGM-DnY0
https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcSiX_elwJQXGlToaEhFD5j2dBkP70PYDmA5stig29DC5maNhbfG76aDOyGh
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcQb3ughdUcPUgWAF6SkPFnyiJhe9Eb-NLbEZl_r7Pvt4B3mZN1SVGv0J-s
ngoue
  • 1,045
  • 1
  • 12
  • 25
  • I have tried scraping using urllib, which mostly gives me "Forbidden" back, which I believe is because of the Disallowing, that you mention. urllib works fine for anything but google images. I am aware that there are no "imgurl"-s in the text that requests parses. The results that you get are the thumbnails of the images. It is better than nothing, but I would like to harvest the full resolution images. Problem is that the parsing never contains that. Is there any way to make request follow the scripts and actually let it fetch the address of the source image? – Mikkel Bue Tellus Feb 16 '16 at 18:22
  • That's exactly why it gives you the "Forbidden" back. They've got a whole module built to parse a website's robots.txt file and determine if scraping is allowed. You could try using the `re` library and use regular expressions to find values instead. However, I think it'll be difficult to scrape Google's search pages... They make it hard for a reason. – ngoue Feb 16 '16 at 18:24
  • Anyway, thanks for the edit that fetches the thumbnails :) – Mikkel Bue Tellus Feb 16 '16 at 18:31
0

You can find the attributes by using 'data-src' or 'src' attribute.


REQUEST_HEADER = {
    'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"}

def get_images_new(self, prod_id, name, header, **kw):
        i=1
        man_code = "apple" #anything you want to search for
        url = "https://www.google.com.au/search?q=%s&source=lnms&tbm=isch" % man_code
        _logger.info("Subitemsyyyyyyyyyyyyyy: %s" %url)
        response = urlopen(Request(url, headers={
            'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"}))
        html = response.read().decode('utf-8')
        soup = BeautifulSoup(html, "html.parser")
        image_elements = soup.find_all("img", {"class": "rg_i Q4LuWd"})
        for img in image_elements:
            #temp1 = img.get('src')
            #_logger.info("11111[%s]" % (temp1))
            temp = img.get('data-src')
            if temp and i < 7:
                image = temp
                #_logger.error("11111[%s]" % (image))
                filename = str(i)
                if filename:
                    path = "/your/directory/" + str(prod_id) # your filename
                    if not os.path.exists(path):
                        os.mkdir(path)
                    _logger.error("ath.existath.existath.exist[%s]" % (image))
                    imagefile = open(path + "/" + filename + ".png", 'wb+')
                    req = Request(image, headers=REQUEST_HEADER)
                    resp = urlopen(req)
                    imagefile.write(resp.read())
                    imagefile.close()
                i += 1

  • Please elaborate your answer further with more details. Your current answer *You can find the attributes by using 'data-src' or 'src' attribute.* is very generic and doesn't seem to actually answer OP's question / problem as given in the question's description. – Ivo Mori Sep 29 '20 at 00:36