I wrote a script to download images form google Image search which I currently downloading 100 original images
The original script I wrote on stackoverflow answer
Python - Download Images from google Image search?
which I will explain in detail how I am scraping url’s of original Images from Google Image search using urllib2 and BeautifulSoup
For example if u want to scrape images of movie terminator 3 from google image search
query= "Terminator 3"
query= '+'.join(query.split()) #this will make the query terminator+3
url="https://www.google.co.in/search?q="+query+"&source=lnms&tbm=isch"
header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"
}
req = urllib2.Request(url,headers=header)
soup= urllib2.urlopen(req)
soup = BeautifulSoup(soup)
variable soup above contains the html code of the page that is requested now we need to extract the images for that u have to open the web page in your browser and and do inspect element on the image
here you will find the the tags containing the image of the url
for example for google image i found "div",{"class":"rg_meta"} containing the link to image
You can search up the BeautifulSoup documentation
print soup.find_all("div",{"class":"rg_meta"})
You will get a list of results as
<div class="rg_meta">{"cl":3,"cr":3,"ct":12,"id":"C0s-rtOZqcJOvM:","isu":"emuparadise.me","itg":false,"ity":"jpg","oh":540,"ou":"http://199.101.98.242/media/images/66433-Terminator_3_The_Redemption-1.jpg","ow":960,"pt":"Terminator 3 The Redemption ISO \\u0026lt; GCN ISOs | Emuparadise","rid":"VJSwsesuO1s1UM","ru":"http://www.emuparadise.me/Nintendo_Gamecube_ISOs/Terminator_3_The_Redemption/66433","s":"Screenshot Thumbnail / Media File 1 for Terminator 3 The Redemption","th":168,"tu":"https://encrypted-tbn2.gstatic.com/images?q\\u003dtbn:ANd9GcRs8dp-ojc4BmP1PONsXlvscfIl58k9hpu6aWlGV_WwJ33A26jaIw","tw":300}</div>
the result above contains link to our image url
http://199.101.98.242/media/images/66433-Terminator_3_The_Redemption-1.jpg
You can extract these links and images as follows
ActualImages=[]# contains the link for Large original images, type of image
for a in soup.find_all("div",{"class":"rg_meta"}):
link , Type =json.loads(a.text)["ou"] ,json.loads(a.text)["ity"]
ActualImages.append((link,Type))
for i , (img , Type) in enumerate( ActualImages):
try:
req = urllib2.Request(img, headers={'User-Agent' : header})
raw_img = urllib2.urlopen(req).read()
if not os.path.exists(DIR):
os.mkdir(DIR)
cntr = len([i for i in os.listdir(DIR) if image_type in i]) + 1
print cntr
if len(Type)==0:
f = open(DIR + image_type + "_"+ str(cntr)+".jpg", 'wb')
else :
f = open(DIR + image_type + "_"+ str(cntr)+"."+Type, 'wb')
f.write(raw_img)
f.close()
except Exception as e:
print "could not load : "+img
print e
Voila now u can use this script to download images from google search. Or for collecting training images
For the fully working script you can get it here
https://gist.github.com/rishabhsixfeet/8ff479de9d19549d5c2d8bfc14af9b88