3

I was following this past question (Extracting image src based on attribute with BeautifulSoup) to try to extract all the images from a google images page. I was getting a "urllib2.HTTPError: HTTP Error 403: Forbidden" error but was able to get past it using this:

req = urllib2.Request(url, headers={'User-Agent' : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.30 (KHTML, like Gecko) Ubuntu/11.04 Chromium/12.0.742.112 Chrome/12.0.742.112 Safari/534.30"})

however, then I got a new error that seems to be telling me that the src attribute does not exist:

Traceback (most recent call last):
  File "Desktop/webscrapev2.py", line 13, in <module>
print(tag['src'])
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/bs4/element.py", line 958, in __getitem__
return self.attrs[key]
KeyError: 'src'

I was able to get over that error by checking specifically for the 'src' attribute but most of my images when extracted, dont have the src attribute. It seems like google is doing something to obscure my ability to extract even a few images (I know requests are limited but i thought it was at least 10).

For example printing out the variable tag (see code below) gives me this:

 <img alt="Image result for baseball pitcher" class="rg_i" data-src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRZK59XKmZhYbaC8neSzY2KtS-aePhXYYPT2JjIGnW1N25codtr2A" data-sz="f" jsaction="load:str.tbn" name="jxlMHbZd-duNgM:" onload="google.aft&amp;&amp;google.aft(this)"/>

But printing out the variable v gives 'None'. I have no idea why this is happening nor how to get the actual image from what its returning. Does anyone know how to get the actual images? I'm especially worries since the data-src URL starts with encrypted... Should I query data-src to get the image instead of src? Any assistance or advice would be super appreciated!

Here is my full code (in Python):

 from bs4 import BeautifulSoup
 import urllib2

 url = "https://www.google.com/search? q=baseball+pitcher&espv=2&biw=980&bih=627&source=lnms&tbm=isch&sa=X&ved=0ahUKEwj5h8-9lfjLAhUE7mMKHdgKD0YQ_AUIBigB"
#'http://www.imdb.com/title/tt%s/' % (id,)

req = urllib2.Request(url, headers={'User-Agent' : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.30 (KHTML, like Gecko) Ubuntu/11.04 Chromium/12.0.742.112 Chrome/12.0.742.112 Safari/534.30"})

soup = BeautifulSoup(urllib2.urlopen(req).read(), "lxml")
print "before FOR"
for tag in soup.findAll('img'): 
print "inside FOR"
v = tag.get('src', tag.get('dfr-src'))  # get's "src", else "dfr_src", if both are missing - None
print v
print tag
if v is None:
    continue
    print("v is NONE")
print(tag['src'])
Community
  • 1
  • 1

3 Answers3

9

Oh, boy. You picked the wrong site to scrape from. :)

Google's Defenses

First off, Google is (obviously) Google. It knows web crawlers and web scrapers very well - its entire business is founded on them.

So it knows all of the tricks that ordinary people get up to, and more importantly has an important mandate to make sure nobody else except end users get their hands on their images.

Didn't pass a User-Agent header? Now Google knows you're a scraper bot that doesn't bother pretending to be a browser, and forbids you from accessing its content. That's why you got a 403: Forbidden error the first time - the server realised you were a bot and prevented you from accessing material. It's the simplest technique to block automated bots.

Google Builds Pages through Javascript

Don't have Javascript parsing capability (which Python requests, urllib and its ilk don't)? Now you can't view half your images because the way Google Image search results works (if you inspect the Network tab in your Chrome console as Google Images is loading) is that a few bundled requests are made to various content providers that then systematically add a src attribute to a placeholder img tag through inline obfuscated Javascript code.

At the very beginning of time, all of your images are essentially blank, with just a custom data-src attribute to coordinate activities. Requests are made to image source providers as soon as the browser begins to parse Javascript (because Google probably makes use of its own CDN, these images are transferred to your computer very quickly), and then page Javascript does the arduous task of chunking the received data, identifying which img placeholder it should go to and then updating src appropriately. These are all time-intensive operations, and I won't even pretend to know how Google can make them happen so fast (although note that messing with network throttling operations in Dev Tools on Chrome 48 can cause Google Images to hang, for some bizarre reason, so there's probably some major network-level code-fu going on over there).

These image source providers appear to begin with https://encrypted..., which doesn't seem to be something to worry about - it probably just means that Google applies a custom encryption scheme on the data as its being sent over the network on top of HTTPS, which is then decoded by the browser. Google practises end-to-end encryption beyond just HTTPS - I believe every layer of the stack works only with encrypted data, with encryption and decryption only at the final and entry point - and I wouldn't be surprised to see the same technology behind, for example, Google Accounts.

(Note: all the above comes from poking around in Chrome Dev Tools for a bit and spending time with de-obfuscators. I am not affiliated with Google, and my understanding is most likely probably incomplete or even woefully wrong.)

Without a bundled Javascript interpreter, it is safe to say that Google Images is effectively a blank wall.

Google's Final Dirty Trick

But now say you use a scraper that is capable of parsing and executing Javascript to update the page HTML - something like a headless browser (here's a list of such browsers). Can you still expect to be able to get all the images just by visiting the src?

Not exactly. Google Images embeds images in its result pages.

In other words, it does not link to other pages, it copies the images in textual format and literally writes down the image in base64 encoding. This reduces the number of connections needed significantly and improves page loading time.

You can see this for yourself if you navigate to Google Images, right click on any image, and hit Inspect element. Here's a typical HTML tag for an image on Google Images:

<img data-sz="f" name="m4qsOrXytYY2xM:" class="rg_i" alt="Image result for google images" jsaction="load:str.tbn" onload="google.aft&amp;&amp;google.aft(this)" src="" style="width: 167px; height: 167px; margin-left: -7px; margin-right: -6px; margin-top: 0px;">

Note the massive wall of text buried in src. That is quite literally the image itself, written in base 64. When we see an image on our screen, we are actually viewing the result of this very text parsed and rendered by a suitable graphics engine. Modern browsers support decoding and rendering of base64-encoded URIs, so it's not a surprise you can literally copy-paste the relevant text into your address bar, hit Enter and view the image at once.

To get the image back, you can decode this wall of text (after suitably parsing it to remove the data:image/jpeg;base64,) using the base64 module in Python:

import base64
base64_string = ... # that monster you saw above
decoded_string = base64.b64decode(your_string)

You must also make sure to parse the image type appropriately from the start of the src attribute, write the decoded_string to a file and finally save it with the file extension you received from the data attribute. phew

tl;dr

Don't go after Google Images as your first major scraping project. It's

  • hard. Wikipedia is much easier to get ahold of.

  • in violation of their Terms of Service (although what scraping isn't? and note I am not a lawyer and this doesn't constitute legal advice) where they explicitly say

    Don’t misuse our Services. For example, don’t interfere with our Services or try to access them using a method other than the interface and the instructions that we provide.

  • really impossible to predict how to improve on. I wouldn't be surprised if Google was using additional authentication mechanisms even after spoofing a human browser as much as possible (for instance, a custom HTTP header), and no one except an anonymous rebellious Google employee eager to reduce his/her master to rubble (unlikely) could help you out then.

  • significantly easier to use Google's provided Custom Search API, which lets you simply ask Google for a set of images programmatically without the hassle of scraping. This API is rate-limited to about a hundred requests a day, which is more than enough for a hobby project. Here are some instructions on how to use it for images. As a rule, use an API before considering scraping.

Community
  • 1
  • 1
Akshat Mahajan
  • 9,543
  • 4
  • 35
  • 44
  • Even though google embeds images into it's pages, it also links to the original source. Just tried it this week and it works. It's a pain to scrape G, but you can use a headless browser to load the page and then use the HTML from the headless browser in BeautifulSoup to parse the results. I also created a new Firefox Profile with an "Image blocker" plugin to reduce load on my network connection. – Raghavendra Kumar Nov 10 '17 at 09:07
  • Excellent answer, if i would had read this i wouldn't had wasted one full week trying to scrape google images :D Also is there any way we could download images from google by sending search query for free ? – Chandraprakash Jul 23 '20 at 10:42
  • While webscraping for Google Shopping, I came across src contents as -  .. and the same text is present for every src attribute across the page. How should I make sense of this? – Pallavi Sep 08 '20 at 12:37
  • @Pallavi: What you are looking at is a [data URI](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URIs). Long story short, it's a GIF encoded in base64. Simply copy the section as is after `base64,`, decode from base64 to binary on disk, and feed it to an appropriate GIF viewer. – Akshat Mahajan Sep 08 '20 at 17:16
  • @AkshatMahajan with above data (after base64), I am only able to get 1x1 size image. When I checked other attributes of img, I found data-deferred="1". And then I found the real image data stored in – Pallavi Sep 08 '20 at 17:33
  • 1
    @Pallavi Grab all the content in ` – Akshat Mahajan Sep 08 '20 at 23:38
  • @Pallavi use thiss function to parse image from google ```def parse_img(id): img_elem=y.find("img", id) image=re.search("\(function\(\){var s='(.*)\';var i=\['"+img_elem['id']+"'];",wholehtmllstring) try: img=image[1] except: img=img_elem['src'] return str(img).replace("\\x3d","")``` – Wisdomrider Apr 22 '21 at 10:41
1

To scrape Google Images using requests and beautifulsoup libraries, you need to parse data from the page code, inside <script> tags using regular expressions.

If you only need to parse thumbnail size images, you can do it by passing content-type (solution found from MendelG) query params into HTTP request:

import requests
from bs4 import BeautifulSoup

params = {
    "q": "batman wallpaper",
    "tbm": "isch", 
    "content-type": "image/png",
}

html = requests.get("https://www.google.com/search", params=params)
soup = BeautifulSoup(html.text, 'html.parser')

for img in soup.select("img"):
  print(img["src"])

To scrape the full-res image URL with requests and beautifulsoup you need to scrape data from the page source code via regex.

Find all <script> tags:

soup.select('script')

Match images data via regex:

matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))

Match desired images (full res size) via regex:

# https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
# if you try to json.loads() without json.dumps() it will throw an error:
# "Expecting property name enclosed in double quotes"
matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)

matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
                                                    matched_images_data_json)

Extract and decode them using bytes() and decode():

for fixed_full_res_image in matched_google_full_resolution_images:
    original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
    original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')

If you need to save them, you can do it via urllib.request.urlretrieve(url, filename) (more in-depth):

import urllib.request

# often times it will throw 404 error, to avoid it we need to pass user-agent

opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
urllib.request.install_opener(opener)

urllib.request.urlretrieve(original_size_img, f'LOCAL_FOLDER_NAME/YOUR_IMAGE_NAME.jpg') # you can skip folder path and it will save them in current working directory

Code and full example in the online IDE:

import requests, lxml, re, json, urllib.request
from bs4 import BeautifulSoup


headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
    "q": "pexels cat",
    "tbm": "isch", 
    "hl": "en",
    "ijn": "0",
}

html = requests.get("https://www.google.com/search", params=params, headers=headers)
soup = BeautifulSoup(html.text, 'lxml')


def get_images_data():

    print('\nGoogle Images Metadata:')
    for google_image in soup.select('.isv-r.PNCib.MSM1fd.BUooTd'):
        title = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['title']
        source = google_image.select_one('.fxgdke').text
        link = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['href']
        print(f'{title}\n{source}\n{link}\n')

    # this steps could be refactored to a more compact
    all_script_tags = soup.select('script')

    # # https://regex101.com/r/48UZhY/4
    matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
    
    # https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
    # if you try to json.loads() without json.dumps it will throw an error:
    # "Expecting property name enclosed in double quotes"
    matched_images_data_fix = json.dumps(matched_images_data)
    matched_images_data_json = json.loads(matched_images_data_fix)

    # https://regex101.com/r/pdZOnW/3
    matched_google_image_data = re.findall(r'\[\"GRID_STATE0\",null,\[\[1,\[0,\".*?\",(.*),\"All\",', matched_images_data_json)

    # https://regex101.com/r/NnRg27/1
    matched_google_images_thumbnails = ', '.join(
        re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
                   str(matched_google_image_data))).split(', ')

    print('Google Image Thumbnails:')  # in order
    for fixed_google_image_thumbnail in matched_google_images_thumbnails:
        # https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
        google_image_thumbnail_not_fixed = bytes(fixed_google_image_thumbnail, 'ascii').decode('unicode-escape')

        # after first decoding, Unicode characters are still present. After the second iteration, they were decoded.
        google_image_thumbnail = bytes(google_image_thumbnail_not_fixed, 'ascii').decode('unicode-escape')
        print(google_image_thumbnail)

    # removing previously matched thumbnails for easier full resolution image matches.
    removed_matched_google_images_thumbnails = re.sub(
        r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', '', str(matched_google_image_data))

    # https://regex101.com/r/fXjfb1/4
    # https://stackoverflow.com/a/19821774/15164646
    matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
                                                       removed_matched_google_images_thumbnails)


    print('\nDownloading Google Full Resolution Images:')  # in order
    for index, fixed_full_res_image in enumerate(matched_google_full_resolution_images):
        # https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
        original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
        original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
        print(original_size_img)

        # ------------------------------------------------
        # Download original images

        # print(f'Downloading {index} image...')
        
        opener=urllib.request.build_opener()
        opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
        urllib.request.install_opener(opener)

        urllib.request.urlretrieve(original_size_img, f'Images/original_size_img_{index}.jpg')


get_images_data()


-------------
'''
Google Images Metadata:
9,000+ Best Cat Photos · 100% Free Download · Pexels Stock Photos
pexels.com
https://www.pexels.com/search/cat/
...

Google Image Thumbnails:
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR2cZsuRkkLWXOIsl9BZzbeaCcI0qav7nenDvvqi-YSm4nVJZYyljRsJZv6N5vS8hMNU_w&usqp=CAU
...

Full Resolution Images:
https://images.pexels.com/photos/1170986/pexels-photo-1170986.jpeg?cs=srgb&dl=pexels-evg-culture-1170986.jpg&fm=jpg
https://images.pexels.com/photos/3777622/pexels-photo-3777622.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500
...
'''

Alternatively, you can achieve the same thing by using Google Images API from SerpApi. It's a paid API with a free plan.

The difference in your case is that you don't have to deal with regex to match and extract needed data from the source code of the page, instead, you only need to iterate over structured JSON and get what you want.

Code to integrate:

import os, urllib.request, json # json for pretty output
from serpapi import GoogleSearch


def get_google_images():
    params = {
      "api_key": os.getenv("API_KEY"),
      "engine": "google",
      "q": "pexels cat",
      "tbm": "isch"
    }

    search = GoogleSearch(params)
    results = search.get_dict()

    print(json.dumps(results['images_results'], indent=2, ensure_ascii=False))

    # -----------------------
    # Downloading images

    for index, image in enumerate(results['images_results']):

        # print(f'Downloading {index} image...')
        
        opener=urllib.request.build_opener()
        opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
        urllib.request.install_opener(opener)

        urllib.request.urlretrieve(image['original'], f'SerpApi_Images/original_size_img_{index}.jpg')


get_google_images()

---------------
'''
[
...
  {
    "position": 100, # img number
    "thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRR1FCGhFsr_qZoxPvQBDjVn17e_8bA5PB8mg&usqp=CAU",
    "source": "pexels.com",
    "title": "Close-up of Cat · Free Stock Photo",
    "link": "https://www.pexels.com/photo/close-up-of-cat-320014/",
    "original": "https://images.pexels.com/photos/2612982/pexels-photo-2612982.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500",
    "is_product": false
  }
]
'''

P.S - I wrote a more in-depth blog post about how to scrape Google Images, and how to reduce chance of being blocked while web scraping search engines.

Disclaimer, I work for SerpApi.

Dmitriy Zub
  • 1,398
  • 8
  • 35
0

The best way to solve this problem is by using headless browser like Chrome Webdriver and user simulation libraries like Selenium Py. Beautiful Soup alone isn't adequate.