6

I'm trying do obtain images from Google Image search for a specific query. But the page I download is without pictures and it redirects me to Google's original one. Here's my code:

AGENT_ID   = "Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1"

GOOGLE_URL = "https://www.google.com/images?source=hp&q={0}"

_myGooglePage = ""

def scrape(self, theQuery) :
    self._myGooglePage = subprocess.check_output(["curl", "-L", "-A", self.AGENT_ID, self.GOOGLE_URL.format(urllib.quote(theQuery))], stderr=subprocess.STDOUT)
    print self.GOOGLE_URL.format(urllib.quote(theQuery))
    print self._myGooglePage
    f = open('./../../googleimages.html', 'w')
    f.write(self._myGooglePage)

What am I doing wrong?

Thanks

slwr
  • 1,105
  • 6
  • 16
  • 35

5 Answers5

6

This is the code in Python that I use to search and download images from Google, hope it helps:

import os
import sys
import time
from urllib import FancyURLopener
import urllib2
import simplejson

# Define search term
searchTerm = "hello world"

# Replace spaces ' ' in search term for '%20' in order to comply with request
searchTerm = searchTerm.replace(' ','%20')


# Start FancyURLopener with defined version 
class MyOpener(FancyURLopener): 
    version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'
myopener = MyOpener()

# Set count to 0
count= 0

for i in range(0,10):
    # Notice that the start changes for each iteration in order to request a new set of images for each loop
    url = ('https://ajax.googleapis.com/ajax/services/search/images?' + 'v=1.0&q='+searchTerm+'&start='+str(i*4)+'&userip=MyIP')
    print url
    request = urllib2.Request(url, None, {'Referer': 'testing'})
    response = urllib2.urlopen(request)

    # Get results using JSON
    results = simplejson.load(response)
    data = results['responseData']
    dataInfo = data['results']

    # Iterate for each result and get unescaped url
    for myUrl in dataInfo:
        count = count + 1
        print myUrl['unescapedUrl']

        myopener.retrieve(myUrl['unescapedUrl'],str(count)+'.jpg')

    # Sleep for one second to prevent IP blocking from Google
    time.sleep(1)

You can also find very useful information here.

Jaime Ivan Cervantes
  • 3,579
  • 1
  • 40
  • 38
  • Is it possible define the image type at the given url to Google – erogol Aug 09 '14 at 09:11
  • I have not look at this for a while but check the latest Google API. I think the answer is yes, you can refine your search to ".png", ".jpg", and even to the vector based format ".svg". – Jaime Ivan Cervantes Aug 09 '14 at 17:41
3

Here's a short script I wrote that does the whole deed.

crizCraig
  • 8,487
  • 6
  • 54
  • 53
  • Hello, your script seem to be using PIL. Unfortunately I seem to have HUGE problems in installing PIL on this machine. Since I just need the images, without transforming them in any way, is there a way to get get away without it? – Pietro Speroni Jul 08 '12 at 10:18
  • I'm not sure how to avoid PIL, but I highly recommend MacPorts if you're using a Mac to simplify package installation and install PIL for you. – crizCraig Jul 09 '12 at 20:07
3

I'll give you a hint ... start here:

https://ajax.googleapis.com/ajax/services/search/images?v=1.0&q=JULIE%20NEWMAR

Where JULIE and NEWMAR are your search terms.

That will return the json data you need ... you'll need to parse that using json.load or simplejson.load to get back a dict ... followed by diving into it to find first the responseData, then the results list which contains the individual items whose url you will then want to download.

Though I don't suggest in any way doing automated scraping of Google, since their (deprecated) API for this specifically says not to.

michaelfilms
  • 704
  • 3
  • 5
0

i am just joing to answer this, even though it is old. there is a much simpler way to go about doing this.

def google_image(x):
        search = x.split()
        search = '%20'.join(map(str, search))
        url = 'http://ajax.googleapis.com/ajax/services/search/images?v=1.0&q=%s&safe=off' % search
        search_results = urllib.request.urlopen(url)
        js = json.loads(search_results.read().decode())
        results = js['responseData']['results']
        for i in results: rest = i['unescapedUrl']
        return rest

that is it.

riyoken
  • 574
  • 2
  • 7
  • 17
0

One of the best ways is to use icrawler. Check below answer. It is working for me.

https://stackoverflow.com/a/51204611/4198099

Ravi Hirani
  • 6,511
  • 1
  • 27
  • 42