Keep getting not enough and duplicate images when web scraping Google in Python?

Question

I'm trying to web scrape Google, but I keep getting duplicate images. It downloads about 200, but there are only 60 or so unique images. How do I get more unique images and eliminate duplicates?

Here's my code:

import json
import os
import time
import requests
from PIL import Image
from StringIO import StringIO
from requests.exceptions import ConnectionError
import string 
import urllib
import random

def go(query, path):
BASE_PATH = os.path.join(path, query)
if not os.path.exists(BASE_PATH):
os.makedirs(BASE_PATH)

resultitem = 0
file_save_dir = BASE_PATH
filename_length = 10
filename_charset = string.ascii_letters + string.digits
ipaddress = '163.118.75.137'
url = 'https://ajax.googleapis.com/ajax/services/search/images?'\
         'v=1.0&q=' + query + '&start=%d'

while(resultitem < 60):
 response = requests.get(url % resultitem)
 results = json.loads(response.text)
 for result in results['responseData']['results']: 
   print result['unescapedUrl']
   filename = ''.join(random.choice(filename_charset)
                 for s in range(filename_length))
   urllib.urlretrieve (result['unescapedUrl'],
                  os.path.join(file_save_dir, filename + '.png'))
 resultitem = resultitem + 1 # or + 8 Duplicates?

def main():
go('angry human face', 'myDirectory')
if __name__ == "__main__":
main()

Anthony Kong · Answer 1 · 2014-02-12T00:29:17.507

1

The problem is here:

   filename = ''.join(random.choice(filename_charset)
                 for s in range(filename_length))

It is not unique and you have overwritten files.

You should use tempfile module instead

Alternatively, since what you really care is a unique file name, you can do this:

 for idx, result in enumerate(results['responseData']['results']): 
   print result['unescapedUrl']
   filename = "IMG%s" % idx

idx here will be a unique number for each url

edited Feb 12 '14 at 00:29

answered Feb 11 '14 at 23:38

Anthony Kong

37,791
46
172
304

How would I incorporate it? Sorry, I'm relatively new to Python, and I've never used that module before – user3105664 Feb 12 '14 at 00:02
You can check out this SO post: http://stackoverflow.com/questions/3924117/how-to-use-tempfile-namedtemporaryfile-in-python – Anthony Kong Feb 12 '14 at 00:20
If it is too bad, I have added a possible alternative solution. – Anthony Kong Feb 12 '14 at 00:29

Keep getting not enough and duplicate images when web scraping Google in Python?

1 Answers1