2

I'm trying to do image URL crawling with Python

As a result of confirming the Google image search window with the development tool, there are about 100 image URLs

More URLs appears scrolling down. However, it is okay.

The problem is that only 20 URLs i got.

I opened an addressable request in an html file.

I confirmed that there are only 20 URLs there.

I think there are only 20 image URLs in the request, so only 20 are output.

How do I get all the image URLs?

This is source code.

#-*- coding: utf-8 -*-
import urllib.request
from bs4 import BeautifulSoup

if __name__ == "__main__":
    print("Crawling!!!!!!!!!!!!!!!")

    hdr = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0)', 
           'referer' : 'http:google.com',
           'Accept': 'text/html',
           'Accept':'application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
           'Accept': 'none',
           'Connection': 'keep-alive'}

    inputSearch = "sites:pinterest+white+jeans"
    req = urllib.request.Request("https://www.google.co.kr/searchhl=ko&site=imghp&tbm=isch&source=hp&biw=1600&bih=770&q=" + inputSearch, headers = hdr)
    data = urllib.request.urlopen(req).read()

    bs = BeautifulSoup(data, "html.parser")

    for img in bs.find_all('img'):
        print(img.get('src'))
안진환
  • 51
  • 2
  • 5
  • the link seems to be incorrect `https://www.google.co.kr/searchhl=ko&site=imghp&tbm=isch&source=hp&biw=1600&bih=770&q=` for me and what exactly is your query can you please specify – warl0ck Apr 06 '17 at 12:43

2 Answers2

2

Your link is wrong. You can use the following code and see if it fits your needs.

You just have to pass a searchTerm and the program will open google page and fetch the urls of 20 images.

Code:

def get_images_links(searchTerm):

    import requests
    from bs4 import BeautifulSoup

    searchUrl = "https://www.google.com/search?q={}&site=webhp&tbm=isch".format(searchTerm)
    d = requests.get(searchUrl).text
    soup = BeautifulSoup(d, 'html.parser')

    img_tags = soup.find_all('img')

    imgs_urls = []
    for img in img_tags:
        if img['src'].startswith("http"):
            imgs_urls.append(img['src'])

    return(imgs_urls)

Usage:

get_images_links('computer')

Output:

['https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSeq5kKIsOg6zSM2bSrWEnYhpZEpmOYiiLzqf6qfwKzSVUoZ5rHoya75DM',
 'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcTBUesIhyt4CgASIUDruqvvMzUBFCuG_iV92NXjZPMtPE5v2G626bge0g0',
 'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRYz8c6LUAiyuAsXkMrOH8DC56aFEMy63m8Fw8-ZdutB5EDpw1hl0y3xg',
 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcT33QNycX0Ghqhfqs7Masrk9uvp6d66VlD2djHFfqL4P6phZCJLxkSx0wnt',
 'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRUF11cLRzH2WNfiUJ3WeAOm7Veme0_GLfwoOCs3R5GTQDfcFHMgsNQlyo',
 'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcTxcTcv4NPTboVorbD4I-uJbYjY4KjAR5JaMvUXCg33CLDUqop8IufKNw',
 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTU8MkWwhDgcobqn_H2N3SS7dPVwu3I-ki1Sa_4u5YOEt-rAfOk1Kb2jpHO',
 'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQlGu_Y_dhu60UNyilmIUSuOjX5_UnmcWr2AXGJ0w6BmvCXUZissCrtPcw',
 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQN7ItGvBHD1H9EMBC0ZFDMzNu5nt2L-EK1CKmQE4gRNtylalyTTJQxalY',
 'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcQyFgwD4Wr20OImzk9Uc0gGGI2-7mYQAU6mJn2GEFkpgLTAqUQUm4KL0TUQwQ',
 'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQR0LFRaUGIadOO5_qolg9ZWegXW0OTghzBf1YzoIhpqkaiY1f3YNx4JnE',
 'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcRuOk4nPPPaUdjnZl1pEwGwlfq25GjvZFsshmouB0QaV925KxHg43wJFWc6',
 'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcR5aqLfB9SaFBALzp4Z2qToLeWqeUjqaS3EwNhi6faHRCxYCPMsjhmivKf8',
 'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcR6deLi7H9DCaxJXJyP7lmoixad5Rgo1gBLfVQ35lEWrvpgPoyQJ8CcZ-4',
 'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSPQAfl2WB-AwziLan6NAzvzh2xVDu_XJEcjqSGOdnOJdffo7goWhrFd3wU',
 'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcSB3o5cP8DMk9GqT9wpB1N7q6JtREUwitghlXO65UD5s3xCoLj80QuDlzw',
 'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcQ18lWMvzZcIZvKI36BUUpnBIaa5e4A3TUAVdxAs6hhJ-rod446dMrPph2V',
 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR8XZhvomXcafQehhetM1_ZXOufBvWmEDAbOsqX-fiU5Xu3U6uWAO3XW-M',
 'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQiWudrcl9y0XbtC19abcPfSwO4N060ipv4znqxnpLYWX5UFO-QdzJatd0r',
 'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQtgqDxef3AOsiyUk0J0MbXgZT8c0JsAW3UpoumSTMFSGXde3BETrGSqw']

Edit:

If you want to get more than 20 urls, you must find a way to send an ajax request to get the rest of the page, or you can use selenium to simulate the interaction between you and the webpage.

I've used the second approach (probably there's tons of other ways to do this, if you want, you can optimize this code a lot):

Code2:

def scrape_all_imgs_google(searchTerm):

    from selenium import webdriver
    from bs4 import BeautifulSoup
    from time import sleep

    def scroll_page():
        for i in range(7):
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            sleep(3)

    def click_button():
        more_imgs_button_xpath = '//*[@id="smb"]'
        driver.find_element_by_xpath(more_imgs_button_xpath).click()   

    def create_soup():
        html_source = driver.page_source
        soup = BeautifulSoup(html_source, 'html.parser')

    def find_imgs():
        imgs_urls = []
        for img in soup.find_all('img'):
            try:
                if img['src'].startswith('http'):
                    imgs_urls.append(img['src'])
            except:
                pass

    #create webdriver
    driver = selenium.webdriver.Chrome()

    #define url using search term
    searchUrl = "https://www.google.com/search?q={}&site=webhp&tbm=isch".format(searchTerm)

    #get url
    driver.get(searchUrl)

    try:
        click_button()
        scroll_page()
    except:
        scroll_page()
        click_button()   

    #create soup only after we loaded all imgs when we scroll'ed the page down
    create_soup()

    #find imgs in soup
    find_imgs()

    #close driver
    driver.close()

    #return list of all img urls found in page
    return imgs_urls    

Usage:

urls = scrape_all_imgs_google('computer')

print(len(urls))
print(urls)

Output:

377
['https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcT5Hi9cdE5JPyGl6G3oYfre7uHEie6zM-8q3zQOek0VLqQucGZCwwKGgfoE', 'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcR0tu_xIYB__PVvdH0HKvPd5n1K-0GVbm5PDr1Br9XTyJxC4ORU5e8BVIiF', 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQqHh6ZR6k-7izTfCLFK09Md19xJZAaHbBafCej6S30pkmTOfTFkhhs-Ksn', and etc...

If you don't want to use this code, you can take a look at Google Scraper and see if it has any method that can be useful for you.

dot.Py
  • 5,007
  • 5
  • 31
  • 52
  • Thank you. but i want to get more URLs than 20 what should i do? – 안진환 Apr 07 '17 at 05:33
  • @안진환 I've updated my answer. Take a look at the new function: `scrape_all_imgs_google(searchTerm)`. – dot.Py Apr 07 '17 at 12:18
  • @안진환 I'm glad I was able to help you! And welcome to StackOverflow. If my answer solved your problem you can [mark it as accepted](https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work) to close your question. – dot.Py Apr 10 '17 at 09:40
  • Thank you i got it how to close my question. I want to ask the last question. More image button is not pressed after scrolling down I think it seem to be no exception handling How do I set a condition to press the button when scrolling is no longer down? – 안진환 Apr 10 '17 at 10:59
  • `it seem to be no exception handling` : wrong... take a look at the `try/except` clause right after the `driver.get(searchUrl)` statement. it'll try to push the button, if it don't locate the element, then it'll scroll down again and try to press it. You can tweak a little bit to fit your needs. You can, for example, raise the number `7` at `for i in range(7):` inside the `scroll_page()` function to make it scroll a few more times – dot.Py Apr 10 '17 at 11:08
  • 1
    Oh i solved it. The button was pressed and then it was finished i add a line scroll_page() to try statement – 안진환 Apr 10 '17 at 12:05
1

To get more than 20 image links you can use the ijn URL param, e.g ijn=0 means 100 images and ijn=1 means 200 images, and so on, and you can achieve this without using selenium.

To scrape the full-res image URL with requests and beautifulsoup you need to scrape data from the page source code via regex.

Find all <script> tags:

soup.select('script')

Match images data in the <script> tags via regex:

matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))

Match desired images (full res size in this case) via regex from JSON string:

# https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
# if you try to json.loads() without json.dumps() it will throw an error: "Expecting property name enclosed in double quotes"
matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)

matched_google_image_data = re.findall(r'\[\"GRID_STATE0\",null,\[\[1,\[0,\".*?\",(.*),\"All\",', 
                                                    matched_images_data_json)

# ...

matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
                                                    matched_google_image_data)

Extract and decode them using bytes() and decode():

for fixed_full_res_image in matched_google_full_resolution_images:
    original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
    original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')

Code to scrape and download full-res images and full example in the online IDE:

import requests, lxml, re, json
from bs4 import BeautifulSoup


headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
    "q": "pexels cat",
    "tbm": "isch", 
    "hl": "en",
    "ijn": "0",
}

html = requests.get("https://www.google.com/search", params=params, headers=headers)
soup = BeautifulSoup(html.text, 'lxml')


def get_images_data():

    print('\nGoogle Images Metadata:')
    for google_image in soup.select('.isv-r.PNCib.MSM1fd.BUooTd'):
        title = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['title']
        source = google_image.select_one('.fxgdke').text
        link = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['href']
        print(f'{title}\n{source}\n{link}\n')

    # this steps could be refactored to a more compact
    all_script_tags = soup.select('script')

    # # https://regex101.com/r/48UZhY/4
    matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
    
    # https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
    # if you try to json.loads() without json.dumps it will throw an error:
    # "Expecting property name enclosed in double quotes"
    matched_images_data_fix = json.dumps(matched_images_data)
    matched_images_data_json = json.loads(matched_images_data_fix)

    # https://regex101.com/r/pdZOnW/3
    matched_google_image_data = re.findall(r'\[\"GRID_STATE0\",null,\[\[1,\[0,\".*?\",(.*),\"All\",', matched_images_data_json)

    # https://regex101.com/r/NnRg27/1
    matched_google_images_thumbnails = ', '.join(
        re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
                   str(matched_google_image_data))).split(', ')

    print('Google Image Thumbnails:')  # in order
    for fixed_google_image_thumbnail in matched_google_images_thumbnails:
        # https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
        google_image_thumbnail_not_fixed = bytes(fixed_google_image_thumbnail, 'ascii').decode('unicode-escape')

        # after first decoding, Unicode characters are still present. After the second iteration, they were decoded.
        google_image_thumbnail = bytes(google_image_thumbnail_not_fixed, 'ascii').decode('unicode-escape')
        print(google_image_thumbnail)

    # removing previously matched thumbnails for easier full resolution image matches.
    removed_matched_google_images_thumbnails = re.sub(
        r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', '', str(matched_google_image_data))

    # https://regex101.com/r/fXjfb1/4
    # https://stackoverflow.com/a/19821774/15164646
    matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
                                                       removed_matched_google_images_thumbnails)


    print('\nDownloading Google Full Resolution Images:')  # in order
    for index, fixed_full_res_image in enumerate(matched_google_full_resolution_images):
        # https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
        original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
        original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
        print(original_size_img)


get_images_data()


-------------
'''
Google Images Metadata:
9,000+ Best Cat Photos · 100% Free Download · Pexels Stock Photos
pexels.com
https://www.pexels.com/search/cat/
...

Google Image Thumbnails:
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR2cZsuRkkLWXOIsl9BZzbeaCcI0qav7nenDvvqi-YSm4nVJZYyljRsJZv6N5vS8hMNU_w&usqp=CAU
...

Full Resolution Images:
https://images.pexels.com/photos/1170986/pexels-photo-1170986.jpeg?cs=srgb&dl=pexels-evg-culture-1170986.jpg&fm=jpg
https://images.pexels.com/photos/3777622/pexels-photo-3777622.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500
...
'''

Alternatively, you can achieve the same thing by using Google Images API from SerpApi. It's a paid API with a free plan.

The difference in your case is that you don't have to deal with regex to match and extract needed data from the source code of the page, instead, you only need to iterate over structured JSON and get what you want.

Code to integrate:

import os, json # json for pretty output
from serpapi import GoogleSearch


def get_google_images():
    params = {
      "api_key": os.getenv("API_KEY"),
      "engine": "google",
      "q": "pexels cat",
      "tbm": "isch"
    }

    search = GoogleSearch(params)
    results = search.get_dict()

    print(json.dumps(results['images_results'], indent=2, ensure_ascii=False))


get_google_images()

---------------
'''
[
...
  {
    "position": 100, # img number
    "thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRR1FCGhFsr_qZoxPvQBDjVn17e_8bA5PB8mg&usqp=CAU",
    "source": "pexels.com",
    "title": "Close-up of Cat · Free Stock Photo",
    "link": "https://www.pexels.com/photo/close-up-of-cat-320014/",
    "original": "https://images.pexels.com/photos/2612982/pexels-photo-2612982.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500",
    "is_product": false
  }
]
'''

P.S - I wrote a bit more in-depth blog post about how to scrape Google Images and how to reduce chance of being blocked while web scraping search engines.

Disclaimer, I work for SerpApi.

Dmitriy Zub
  • 1,398
  • 8
  • 35