Only get images of certain size with BeautifulSoup

Question

So I'm trying to do a small crawler to just pick a few Google-search images links and then download them. It's not going to be anything that needs to run 1000 times a day with 1000 queries, but just a simple script to download 10 of the first images for a certain search word.

For that I have the following code:

import requests
from bs4 import BeautifulSoup
import json
import urllib

s = requests.session()
s.headers.update({"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"})

URL = "https://www.google.dk/search"

def get_images(query, start):
    images = []

    screen_width = 1920
    screen_height = 1080
    params = {
        "q": query,
        "sa": "X",
        "biw": screen_width,
        "bih": screen_height,
        "tbm": "isch",
        "ijn": start/100,
        "start": start,
        #"ei": "" - This seems like a unique ID, you might want to use it to avoid getting banned. But you probably still are.
    }

    request = s.get(URL, params=params)
    bs = BeautifulSoup(request.text, "lxml")

    for img in bs.findAll("div", {"class": "rg_meta"}):
        js = json.loads(img.text)

        images.append(js['ou'])


    return images

So basically I get a list of links I can then parse through and download via this code where it even names the images from 1 to how many there is now being crawled:

searchlist = ["cats"] #search strings
nr_img = 5 #number of images to be crawled

for k, searchstring in enumerate(searchlist):
    k += 0
    images = get_images("{}".format(searchstring), 0)

    img_nr_list = []
    for n, x in enumerate(images[0:nr_img]):
        n += 1+k*nr_img
        urllib.urlretrieve("{}".format(x), "\foo\bar\{}.jpg".format(n))
        img_nr_list.append("{}.jpg".format(n))

In principle pretty straight forward. However, my problem is that some images are just thumbnails, or just have a low image size. So my question is: Is there a way in which I can say something like: "If width < 600px and height < 400px then skip" or something like that ?

But Google can ? You can set the image size when searching, right ? So shouldn't that be possible here, or...? — Denver Dang, Aug 01 '17 at 00:19
Sounds like you want a session and likely toggle a few settings to filter for image size. I would highly doubt Google provides an API to scrape images from their site, especially with settings for image size, considering they explicitly forbid scraping from their scraping. — Alex Huszagh, Aug 01 '17 at 00:22
Hmmm, damnit... Then I have to do it manually. That sucks :) Thanks anyways... — Denver Dang, Aug 01 '17 at 00:23
You could use selenium webdriver's `execute_script` function and do https://stackoverflow.com/questions/623172/how-to-get-image-size-height-width-using-javascript on the page — whackamadoodle3000, Aug 01 '17 at 00:27
https://stackoverflow.com/questions/17864603/get-image-dimensions-from-url-in-python — whackamadoodle3000, Aug 01 '17 at 00:30
@whackamadoodle3000, most of those options still require downloading the images, if the height/width isn't manually specified (which it usually isn't). You could use a Selenium and then toggle options through Google's UI, but... That may be too limited. — Alex Huszagh, Aug 01 '17 at 00:38

score 0 · Answer 1 · answered Aug 01 '17 at 01:17

0

I don't know how to do it with beautifulsoup, but there is another python library called ImageScraper that lets you define the max image size

https://pypi.python.org/pypi/ImageScraper

I only tested it out using the command line tool, as it's python 2.7 and I'm normally on python 3+

answered Aug 01 '17 at 01:17

Jim Factor

1,465
1
15
24

1

This limits the image size in bytes, which might correlate with the width and height at a given DPI or image type, but isn't a perfect correlation. This would require heuristics and a classifier to determine if the image is likely above or below a given width/height before downloading. – Alex Huszagh Aug 01 '17 at 01:23

Only get images of certain size with BeautifulSoup

1 Answers1