1

I'm trying to scrape some images from Google search results using Requests and Beautifulsoup. There seems to be code utilizing urllib2 on the net, that works (half the time for me) but I'm trying to use Requests with Beautiful Soup, and I'm having trouble parsing the JSON portion. I'm interested in getting the 'ou' value, which is a link. I'm not exactly sure what I'm doing wrong.

import requests
from bs4 import BeautifulSoup
import json

url =  'https://www.google.com/search?site=&tbm=isch&source=hp&biw=1873&bih=990&'
payload = {'q': 'Blue Sky'}
response = requests.get(url, params = payload)
print (response.url)

images =[]
soup = BeautifulSoup(response.content, 'html.parser')
results2 =soup.find_all(("div",{"class":"rg_meta notranslate"}))
#checking results2, It seems I am indeed extracting the div portion. 


for re in results2:
    link, Type = json.loads((re.text))["ou"] , json.loads((re.text))["ity"]
    images.append(link)

This is how the div class looks:

<div class="rg_meta notranslate">
{"clt":"n",
"id":"tO9o23RfxP9tlM:",
 "isu":"myrabridgforth.com",
 "itg":0,
 "ity":"jpg",
 "oh":742,
 "ou":"http://myrabridgforth.com/wp-content/uploads/blue-   sky-clouds.jpg","ow":1268,"pt":"Myra Bridgforth, Counselor » Blog Archive Ten Ways to Use a Blue ...","rid":"jjIitG_NjwFNSM","rmt":0,"rt":0,"ru":"http://myrabridgforth.com/2015/06/ten-ways-to-use-a-blue-sky-hour-at-a-coffee-shop/","s":"Ten Ways to Use a Blue Sky Hour at a Coffee Shop","st":"Myra Bridgforth, Counselor","th":172,"tu":"https://encrypted-tbn0.gstatic.com/images?q\u003dtbn:ANd9GcTLhBlZEL6ljsKInKzx1V4GX-lXeksntKy6B4UkmVrOB_2uNoTbcQ","tw":294}
</div>

Running the JSON line, I am ending up in this error:

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Here is how the first 15% or so of the results2 result set looks:

[<div id="gbar"><nobr><a class="gb1" href="https://www.google.com/search?tab=iw">Search</a> <b class="gb1">Images</b> <a class="gb1" href="https://maps.google.com/maps?hl=en&amp;tab=il">Maps</a> <a class="gb1" href="https://play.google.com/?hl=en&amp;tab=i8">Play</a> <a class="gb1" href="https://www.youtube.com/results?tab=i1">YouTube</a> <a class="gb1" href="https://news.google.com/nwshp?hl=en&amp;tab=in">News</a> <a class="gb1" href="https://mail.google.com/mail/?tab=im">Gmail</a> <a class="gb1" href="https://drive.google.com/?tab=io">Drive</a> <a class="gb1" href="https://www.google.com/intl/en/options/" style="text-decoration:none"><u>More</u> »</a></nobr></div>,
 <div id="guser" width="100%"><nobr><span class="gbi" id="gbn"></span><span class="gbf" id="gbf"></span><span id="gbe"></span><a class="gb4" href="http://www.google.com/history/optout?hl=en">Web History</a> | <a class="gb4" href="/preferences?hl=en">Settings</a> | <a class="gb4" href="https://accounts.google.com/ServiceLogin?hl=en&amp;passive=true&amp;continue=https://www.google.com/search%3Fsite%3D%26tbm%3Disch%26source%3Dhp%26biw%3D1873%26bih%3D990%26q%3DBlue%2BSky" id="gb_70" target="_top">Sign in</a></nobr></div>,
 <div class="gbh" style="left:0"></div>,
 <div class="gbh" style="right:0"></div>,
 <div id="logocont"><h1><a href="/webhp?hl=en" id="logo" style="background:url(/images/nav_logo229.png) no-repeat 0 -41px;height:37px;width:95px;display:block" title="Go to Google Home"></a></h1></div>,
 <div class="lst-a"><table cellpadding="0" cellspacing="0"><tr><td class="lst-td" valign="bottom" width="555"><div style="position:relative;zoom:1"><input autocomplete="off" class="lst" id="sbhost" maxlength="2048" name="q" title="Search" type="text" value="Blue Sky"/></div></td></tr></table></div>,

My code is based off rishabhr0y's code which seems to be having success (according to the comments) with Beautiful Soup and urllib2.

Python - Download Images from google Image search?

Moondra
  • 4,399
  • 9
  • 46
  • 104
  • *"I'm trying to scrape some images from Google search results"* I'm pretty sure that is against Google's TOS. – Tomalak Jul 06 '17 at 22:36
  • BeautifulSoup can't find any `div` tags with `rg_meta notranslate`. They're probably dynamically created. – Aran-Fey Jul 06 '17 at 22:42
  • I'm only going to scrape a few images as a test. I want to know why the urllib code works as opposed to requests. – Moondra Jul 06 '17 at 22:43
  • @Rawing People seem to be having success with rishabhr0y's code in this link, `https://stackoverflow.com/questions/20716842/python-download-images-from-google-image-search/28487500#28487500`, which is using a similar div tag. He is using urllib though. Not sure if urllib is making the difference. In my testing, It seemed to be working when I fully utilized his code. A few times when I was trying to deconstruct his code, it wasn't working. The code seems to have been reused a lot, and people seem to be having success. So I'm not sure what I'm doing wrong. – Moondra Jul 06 '17 at 22:53
  • Your result2 doesn't contain the div that you say it has. – cs95 Jul 06 '17 at 22:53
  • @cᴏʟᴅsᴘᴇᴇᴅ Thanks for the reply. I've updated the OP to show how result2 looks, when I print it. If the `div` is dynamic, I'm not sure how people are having success off of the code (rishabhr0y) in the link provided. – Moondra Jul 06 '17 at 23:02
  • @Moondra It isn't uncommon for Google to change the content they serve. It's quite possible that in the last 2 years, the structure of the html has changed. You'll just have to find another way. – cs95 Jul 06 '17 at 23:08
  • @cᴏʟᴅsᴘᴇᴇᴅ I see. Thank you. – Moondra Jul 06 '17 at 23:11

1 Answers1

0

To scrape the full-res image URL with requests and beautifulsoup you need to scrape data from the page source code via regex.

Find all tags:

soup.select('script')

Match images data via regex:

matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))

Match desired images (full res size) via regex from JSON string:

matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
                                                    matched_images_data)

Extract and decode them using bytes() and decode():

for fixed_full_res_image in matched_google_full_resolution_images:
    original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
    original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')

If you need to save them, you have two easy options via urllib.request.urlretrieve or requests:

To save images via urllib.request.urlretrieve(url, filename) (more in-depth):

import urllib.request

# often times it will throw 404 error, to avoid it we need to pass user-agent

opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
urllib.request.install_opener(opener)

urllib.request.urlretrieve(original_size_img, f'LOCAL_FOLDER_NAME/YOUR_IMAGE_NAME.jpg') # you can skip folder path and it will save them in current working directory

To save images via requests (code taken from this answer):

import requests

url = "YOUR_IMG.jpg"
response = requests.get(url)
if response.status_code == 200:
    with open("/YOUR/PATH/TO_IMAGE/sample_img.jpg", 'wb') as f:
        f.write(response.content)

Code to scrape and download full-res images, and full example in the online IDE:

import requests, lxml, re, json, urllib.request
from bs4 import BeautifulSoup


headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
    "q": "pexels cat",
    "tbm": "isch", 
    "hl": "en",
    "ijn": "0",
}

html = requests.get("https://www.google.com/search", params=params, headers=headers)
soup = BeautifulSoup(html.text, 'lxml')


def get_images_data():

    print('\nGoogle Images Metadata:')
    for google_image in soup.select('.isv-r.PNCib.MSM1fd.BUooTd'):
        title = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['title']
        source = google_image.select_one('.fxgdke').text
        link = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['href']
        print(f'{title}\n{source}\n{link}\n')

    # this steps could be refactored to a more compact
    all_script_tags = soup.select('script')

    # # https://regex101.com/r/48UZhY/4
    matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
    
    # https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
    # if you try to json.loads() without json.dumps it will throw an error:
    # "Expecting property name enclosed in double quotes"
    matched_images_data_fix = json.dumps(matched_images_data)
    matched_images_data_json = json.loads(matched_images_data_fix)

    # https://regex101.com/r/pdZOnW/3
    matched_google_image_data = re.findall(r'\[\"GRID_STATE0\",null,\[\[1,\[0,\".*?\",(.*),\"All\",', matched_images_data_json)

    # https://regex101.com/r/NnRg27/1
    matched_google_images_thumbnails = ', '.join(
        re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
                   str(matched_google_image_data))).split(', ')

    print('Google Image Thumbnails:')  # in order
    for fixed_google_image_thumbnail in matched_google_images_thumbnails:
        # https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
        google_image_thumbnail_not_fixed = bytes(fixed_google_image_thumbnail, 'ascii').decode('unicode-escape')

        # after first decoding, Unicode characters are still present. After the second iteration, they were decoded.
        google_image_thumbnail = bytes(google_image_thumbnail_not_fixed, 'ascii').decode('unicode-escape')
        print(google_image_thumbnail)

    # removing previously matched thumbnails for easier full resolution image matches.
    removed_matched_google_images_thumbnails = re.sub(
        r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', '', str(matched_google_image_data))

    # https://regex101.com/r/fXjfb1/4
    # https://stackoverflow.com/a/19821774/15164646
    matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
                                                       removed_matched_google_images_thumbnails)


    print('\nDownloading Google Full Resolution Images:')  # in order
    for index, fixed_full_res_image in enumerate(matched_google_full_resolution_images):
        # https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
        original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
        original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
        print(original_size_img)

        # ------------------------------------------------
        # Download original images

        # print(f'Downloading {index} image...')
        
        opener=urllib.request.build_opener()
        opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
        urllib.request.install_opener(opener)

        urllib.request.urlretrieve(original_size_img, f'Images/original_size_img_{index}.jpg')


get_images_data()


-------------
'''
Google Images Metadata:
9,000+ Best Cat Photos · 100% Free Download · Pexels Stock Photos
pexels.com
https://www.pexels.com/search/cat/
...

Google Image Thumbnails:
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR2cZsuRkkLWXOIsl9BZzbeaCcI0qav7nenDvvqi-YSm4nVJZYyljRsJZv6N5vS8hMNU_w&usqp=CAU
...

Full Resolution Images:
https://images.pexels.com/photos/1170986/pexels-photo-1170986.jpeg?cs=srgb&dl=pexels-evg-culture-1170986.jpg&fm=jpg
https://images.pexels.com/photos/3777622/pexels-photo-3777622.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500
...
'''

Alternatively, you can achieve the same thing by using Google Images API from SerpApi. It's a paid API with a free plan.

The difference in your case is that you don't have to deal with regex to match and extract needed data from the source code of the page, instead, you only need to iterate over structured JSON and get what you want.

Code to integrate:

import os, urllib.request, json # json for pretty output
from serpapi import GoogleSearch


def get_google_images():
    params = {
      "api_key": os.getenv("API_KEY"),
      "engine": "google",
      "q": "pexels cat",
      "tbm": "isch"
    }

    search = GoogleSearch(params)
    results = search.get_dict()

    # print(json.dumps(results['suggested_searches'], indent=2, ensure_ascii=False))
    print(json.dumps(results['images_results'], indent=2, ensure_ascii=False))

    # -----------------------
    # Downloading images

    for index, image in enumerate(results['images_results']):

        # print(f'Downloading {index} image...')
        
        opener=urllib.request.build_opener()
        opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
        urllib.request.install_opener(opener)

        urllib.request.urlretrieve(image['original'], f'SerpApi_Images/original_size_img_{index}.jpg')


get_google_images()

---------------
'''
[
...
  {
    "position": 100, # img number
    "thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRR1FCGhFsr_qZoxPvQBDjVn17e_8bA5PB8mg&usqp=CAU",
    "source": "pexels.com",
    "title": "Close-up of Cat · Free Stock Photo",
    "link": "https://www.pexels.com/photo/close-up-of-cat-320014/",
    "original": "https://images.pexels.com/photos/2612982/pexels-photo-2612982.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500",
    "is_product": false
  }
]
'''

P.S - I wrote a bit more in-depth blog post about how to scrape Google Images.

Disclaimer, I work for SerpApi.

Dmitriy Zub
  • 1,398
  • 8
  • 35