15

I do some search to google images

http://www.google.com/search?hl=en&q=panda&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&biw=1287&bih=672&um=1&ie=UTF-8&tbm=isch&source=og&sa=N&tab=wi&ei=qW4FUJigJ4jWtAbToInABg

and the result is thousands of photos. I am looking for a shell script that will download the first n images, for example 1000 or 500.

How can I do this ?

I guess I need some advanced regular expressions or something like that. I was trying many things but to no avail, can someone help me please?

Shahbaz
  • 46,337
  • 19
  • 116
  • 182
Lukap
  • 31,523
  • 64
  • 157
  • 244
  • 5
    You say you tried many things - such as? =) – J. Steen Jul 17 '12 at 14:16
  • such as using curl and wget with combination with grep command . . . but it didn't give me any expected results , I put 2 day effort in parsing and still a lot of problems – Lukap Jul 17 '12 at 15:32
  • not a shell script but if you are still looking for a command line script then this may help https://github.com/hardikvasa/google-images-download – hnvasa Apr 07 '18 at 06:59

10 Answers10

18

update 4: PhantomJS is now obsolete, I made a new script google-images.py in Python using Selenium and Chrome headless. See here for more details: https://stackoverflow.com/a/61982397/218294

update 3: I fixed the script to work with phantomjs 2.x.

update 2: I modified the script to use phantomjs. It's harder to install, but at least it works again. http://sam.nipl.net/b/google-images http://sam.nipl.net/b/google-images.js

update 1: Unfortunately this no longer works. It seems Javascript and other magic is now required to find where the images are located. Here is a version of the script for yahoo image search: http://sam.nipl.net/code/nipl-tools/bin/yimg

original answer: I hacked something together for this. I normally write smaller tools and use them together, but you asked for one shell script, not three dozen. This is deliberately dense code.

http://sam.nipl.net/code/nipl-tools/bin/google-images

It seems to work very well so far. Please let me know if you can improve it, or suggest any better coding techniques (given that it's a shell script).

#!/bin/bash
[ $# = 0 ] && { prog=`basename "$0"`;
echo >&2 "usage: $prog query count parallel safe opts timeout tries agent1 agent2
e.g. : $prog ostrich
       $prog nipl 100 20 on isz:l,itp:clipart 5 10"; exit 2; }
query=$1 count=${2:-20} parallel=${3:-10} safe=$4 opts=$5 timeout=${6:-10} tries=${7:-2}
agent1=${8:-Mozilla/5.0} agent2=${9:-Googlebot-Image/1.0}
query_esc=`perl -e 'use URI::Escape; print uri_escape($ARGV[0]);' "$query"`
dir=`echo "$query_esc" | sed 's/%20/-/g'`; mkdir "$dir" || exit 2; cd "$dir"
url="http://www.google.com/search?tbm=isch&safe=$safe&tbs=$opts&q=$query_esc" procs=0
echo >.URL "$url" ; for A; do echo >>.args "$A"; done
htmlsplit() { tr '\n\r \t' ' ' | sed 's/</\n</g; s/>/>\n/g; s/\n *\n/\n/g; s/^ *\n//; s/ $//;'; }
for start in `seq 0 20 $[$count-1]`; do
wget -U"$agent1" -T"$timeout" --tries="$tries" -O- "$url&start=$start" | htmlsplit
done | perl -ne 'use HTML::Entities; /^<a .*?href="(.*?)"/ and print decode_entities($1), "\n";' | grep '/imgres?' |
perl -ne 'use URI::Escape; ($img, $ref) = map { uri_unescape($_) } /imgurl=(.*?)&imgrefurl=(.*?)&/;
$ext = $img; for ($ext) { s,.*[/.],,; s/[^a-z0-9].*//i; $_ ||= "img"; }
$save = sprintf("%04d.$ext", ++$i); print join("\t", $save, $img, $ref), "\n";' |
tee -a .images.tsv |
while IFS=$'\t' read -r save img ref; do
wget -U"$agent2" -T"$timeout" --tries="$tries" --referer="$ref" -O "$save" "$img" || rm "$save" &
procs=$[$procs + 1]; [ $procs = $parallel ] && { wait; procs=0; }
done ; wait

Features:

  • under 1500 bytes
  • explains usage, if run with no args
  • downloads full images in parallel
  • safe search option
  • image size, type, etc. opts string
  • timeout / retries options
  • impersonates googlebot to fetch all images
  • numbers image files
  • saves metadata

I'll post a modular version some time, to show that it can be done quite nicely with a set of shell scripts and simple tools.

Sam Watkins
  • 7,819
  • 3
  • 38
  • 38
  • thanks for you solution, but please can you give me something more modular or something that is more readable ? I do not know shell so good and I can not make this more modular. Thanks ones again – Lukap Aug 14 '12 at 07:17
  • 1
    Ok, I'll see what I can do. But, if I paste it in the answer it will be a long answer! Did you try running it? – Sam Watkins Aug 15 '12 at 11:16
  • Whoa, no offence but such scripts looks extremely fishy... :-| Especially given the context it reminds me of an old story about one convoluted script expanding to `rm -rf /`. – hijarian Nov 14 '13 at 07:12
  • However, it works perfectly and have solved my own task without any problems. @SamWatkins you've done great job here, thanks! – hijarian Nov 14 '13 at 07:52
  • The google image search is more difficult now. Here is a version of the script for yahoo image search: http://sam.nipl.net/code/nipl-tools/bin/yimg – Sam Watkins Jan 14 '14 at 14:51
  • @SamWatkins I used your script for quite a long time and I'm very thankful for your effort. But as you have commented yourself, it doesn't work anymore. Is there any chance you can update the script to make it work with the current Google Image Search? – JohnCand Feb 20 '14 at 19:51
  • 1
    I changed the script to use "phantomjs", a captive web browser. This makes it more difficult to install but at least it works again. http://sam.nipl.net/b/google-images http://sam.nipl.net/b/google-images.js I fixed the old version of script too, but it only works for a maximum of 100 images since it can't simulate scrolling the page. http://sam.nipl.net/b/google-images-old – Sam Watkins May 02 '14 at 03:55
  • @hijarian, Not fishy, its awesome. Can you not read? Do you think it is going to spawn a bot-net and fire off a few nukes from Russia to North Korea or something? LOL. – Nicholas Hamilton Jul 27 '14 at 07:22
  • Hi @SamWatkins, I am lookig for a similar script to download a few thousands images from google/yahoo images. So, I have not understood how to use your magic script. It still works? – bit Dec 08 '14 at 10:38
  • Great script! I used the yimg script and it works very fine! So, I did not understood how to set number of images to download – bit Dec 08 '14 at 11:56
  • @bit, I think in that script, the option to set number of images to download (2nd argument) is not exact. It might download a few more images... in fact I think it does it in multiples of 60! That could be fixed by inserting `head -n $count |` before the last while loop – Sam Watkins Dec 10 '14 at 00:26
  • @SamWatkins it seems to works! So, I changed "60" in "for start in seq '0 60 $[$count-1]' with '5' and if I type "./yimg cat 1" it saved one image of a cat but i did not understood if the script search for 5 images – bit Dec 11 '14 at 19:04
  • @bit, better change that back to 60 in the seq command, that's about the number of images on a page, and it might download images more than once if you do that. Just add the "head" command in there (I guess you did that). – Sam Watkins Dec 11 '14 at 23:40
  • @SamWatkins if I leave 60 in seq (and yes, I added head `-n $count |` ) the script correctly saves only "n" images but it makes 60 `wget` becoming slow to run on many words – bit Dec 12 '14 at 17:12
  • @SamWatkins A last question: is there a way to save only a format of image? I tried to type `./yimg cat 'as_filetype=png'` or `./yimg cat 'imgtype=png'` but it only saves jpg images – bit Dec 28 '14 at 17:02
  • 1
    @bit you could grep the images list just for the PNG images before downloading them: insert `grep $'\.png\t' |` before the last while loop. I'm not aware of any option in yahoo image search to return only PNG images, although I think google does have such an option. – Sam Watkins Jan 04 '15 at 13:23
  • 1
    @bit look at my Python solution here http://stackoverflow.com/a/28487500/2875380 i van download 100 high resolution images using python – rishabhr0y Aug 02 '16 at 09:52
  • @Lukap look at my Python solution here stackoverflow.com/a/28487500/2875380 i van download 100 high resolution images using python – rishabhr0y Aug 02 '16 at 09:54
6

I dont think you can achieve the entire task using regexes alone. There are 3 parts to this problem-

1.Extract the links of all the images -----> Cant be done with regexes. You need to use a web based language for this. Google has APIs to do this programatically. Check out here and here.

2.Assuming you succeeded in the first step with some web based language, you can use the following regex which uses lookaheads to extract the exact image URL

(?<=imgurl=).*?(?=&)

The above regex says - Grab everything starting after imgurl= and till you encounter the & symbol. See here for an example, where I took the URL of the first image of your search result and extracted the image URL.

How did I arrive at the above regex? By examining the links of the images found in the image search.

3.Now that you've got the image URLs, use some web based language/tool to download your images.

Pavan Manjunath
  • 27,404
  • 12
  • 99
  • 125
  • 2
    In my view, the "correct answer" to this question is to use the APIs and forget about trying to process the HTML. I'd put much more focus and guidance on that part of the answer! ;-) Scraping HTML is always far more complicated than it should be... especially in a 'shell-script'! – Ray Hayes Jul 17 '12 at 15:23
  • @RayHayes u can find my working solution in python beautifulsoup i was able scrape 100 full resolution images form the google image search http://stackoverflow.com/questions/20716842/python-download-images-from-google-image-search/28487500#28487500 – rishabhr0y Aug 03 '16 at 14:47
  • @rishabhr0y commendable, but I stand by my assertion that scraping is a short term solution beholden on the remote party not changing something as simple as their layout. APIs are designed for a purpose, if that purpose is to return search results, then that's the best thing to use. – Ray Hayes Aug 03 '16 at 16:59
2

Rather than doing this in shell with regexps, you may have an easier time if you use something that can actually parse the HTML itself, like PHP's DOMDocument class.

If you're stuck using only shell and need to slurp image URLs, you may not be totally out of luck. Regular Expressions are inappropriate for parsing HTML, because HTML is not a regular language. But you may still be able to get by if your input data is highly predictable. (There is no guarantee of this, because Google updates their products and services regularly and often without prior announcement.)

That said, in the output of the URL you provided in your question, each image URL seems to be embedded in an anchor that links to /imgres?…. If we can parse those links, we can probably gather what we need from them. Within those links, image URLs appear to be preceded with &amp;imgurl=. So let's scrape this.

#!/usr/local/bin/bash

# Possibly violate Google's terms of service by lying about our user agent
agent="Mozilla/5.0 (X11; FreeBSD amd64; rv:12.0) Gecko/20100101 Firefox/12.0"

# Search URL
url="http://www.google.com/search?hl=en&q=panda&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&biw=1287&bih=672&um=1&ie=UTF-8&tbm=isch&source=og&sa=N&tab=wi&ei=qW4FUJigJ4jWtAbToInABg"

curl -A "$agent" -s -D- "$url" \
 | awk '{gsub(/<a href=/,"\n")} 1' \
 | awk '
   /imgres/ {
     sub(/" class=rg_l >.*/, "");       # clean things up
     split($0, fields, "\&amp;");       # gather the "GET" fields
     for (n=1; n<=length(fields); n++) {
       split(fields[n], a, "=");        # split name=value pair
       getvars[a[1]]=a[2];              # store in array
     }
     print getvars["imgurl"];           # print the result
   }
 '

I'm using two awk commands because ... well, I'm lazy, and that was the quickest way to generate lines in which I could easily find the "imgres" string. One could spend more time on this cleaning it up and making it more elegant, but the law of diminishing returns dictates that this is as far as I go with this one. :-)

This script returns a list of URLs that you could download easily using other shell tools. For example, if the script is called getimages, then:

./getimages | xargs -n 1 wget

Note that Google appears to be handing me only 83 results (not 1000) when I run this with the search URL you specified in your question. It's possible that this is just the first page that Google would generally hand out to a browser before "expanding" the page (using JavaScript) when I get near the bottom. The proper way to handle this would be to use Google's search API, per Pavan's answer, and to PAY google for their data if you're making more than 100 searches per day.

Community
  • 1
  • 1
ghoti
  • 45,319
  • 8
  • 65
  • 104
  • "awk: line 5: illegal reference to array fields", did you try the script ? did it worked for you ? cause it doesn't work for me :( – Lukap Jul 18 '12 at 08:11
  • Yes, it worked for me. Remember that there are different versions of awk. Perhaps yours doesn't let `length()` return the number of elements in an array. What kind of awk are you running? (Run `awk --version` for a hint.) If I can duplicate your error, I'll post an update fixing it. – ghoti Jul 18 '12 at 13:12
  • "$ awk --version awk: not an option: --version" , maybe I need to install something ? sudo apt-get install awk doesn't work – Lukap Jul 18 '12 at 17:01
  • how to install your version ?, maybe that is the simplest solution , if I can install your version of awk it would be great – Lukap Jul 18 '12 at 17:09
0

Rather than attempt to parse the HTML (which is very hard and likely to break), consider the API's highlighted by @Paven in his answer.

Additionally, consider using a tool that already tries to do something similar. WGET (web-get) has a spider like feature for following the links (specifically for specified file types). See this answer to a StackOverflow question 'how do i use wget to download all images into a single folder'.

Regex is wonderfully useful, but I don't think it is in this context - remember the Regex mantra:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

-- Jamie Zawinski

Community
  • 1
  • 1
Ray Hayes
  • 14,896
  • 8
  • 53
  • 78
0

with response of Pavan Manjunath, if you want height & width of image

(?<=imgurl=)(?<imgurl>.*?)(?=&).*?(?<=h=)(?<height>.*?)(?=&).*?(?<=w=)(?<width>.*?)(?=&)

You obtain 3 regex groups imgurl, height & width with information.

LeMoussel
  • 5,290
  • 12
  • 69
  • 122
  • You don't need any of those lookarounds: `imgurl=(?.*?)&.*?h=(?.*?)&.*?w=(?.*?)&`. (Pavan shouldn't have used them either.) Also, you're assuming those attributes always appear in the the same order. – Alan Moore Aug 02 '16 at 02:02
0

I found an easier way to do with this tool I can confirm that it works well as of this post. screenshot

Feature Requests to the developer:

  • Get a preview of the image(s) to verify that it's correct.
  • Allow input of multiple terms sequentially (i.e. batch processing).
Vijay
  • 891
  • 3
  • 19
  • 35
0

Python script: to download full resolution images form Google Image Search currently it downloads 100 images per query

from bs4 import BeautifulSoup
import requests
import re
import urllib2
import os
import cookielib
import json

def get_soup(url,header):
    return BeautifulSoup(urllib2.urlopen(urllib2.Request(url,headers=header)),"html.parser")


query = raw_input("query image")# you can change the query for the image  here
image_type="ActiOn"
query= query.split()
query='+'.join(query)
url="https://www.google.co.in/search?q="+query+"&source=lnms&tbm=isch"
print url
#add the directory for your image here
DIR="C:\\Users\\Rishabh\\Pictures\\"+query.split('+')[0]+"\\"
header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"
}
soup = get_soup(url,header)


ActualImages=[]# contains the link for Large original images, type of  image
for a in soup.find_all("div",{"class":"rg_meta"}):
    link , Type =json.loads(a.text)["ou"]  ,json.loads(a.text)["ity"]
    ActualImages.append((link,Type))

print  "there are total" , len(ActualImages),"images"


###print images
for i , (img , Type) in enumerate( ActualImages):
    try:
        req = urllib2.Request(img, headers={'User-Agent' : header})
        raw_img = urllib2.urlopen(req).read()
        if not os.path.exists(DIR):
            os.mkdir(DIR)
        cntr = len([i for i in os.listdir(DIR) if image_type in i]) + 1
        print cntr
        if len(Type)==0:
            f = open(DIR + image_type + "_"+ str(cntr)+".jpg", 'wb')
        else :
            f = open(DIR + image_type + "_"+ str(cntr)+"."+Type, 'wb')


        f.write(raw_img)
        f.close()
    except Exception as e:
        print "could not load : "+img
        print e

i am re posting my solution here the original solution i had posted on the following question https://stackoverflow.com/a/28487500/2875380

Community
  • 1
  • 1
rishabhr0y
  • 838
  • 1
  • 9
  • 14
0

How about using this library?google-images-download

For anyone still looking for a decent way to download 100s of images, can use this command line argument code.

0

I used this to download 1000 images and it 100% worked for me: atif93/google_image_downloader

after you download it open terminal and install Selenium

$ pip install selenium --user

then check your python version

$ python --version

If running python 2.7 then to down download 1000 images of pizza run:

$ python image_download_python2.py 'pizza' '1000'

If running python 3 then to down download 1000 images of pizza run:

$ python image_download_python3.py 'pizza' '1000'

The breakdown is:

python image_download_python2.py <query> <number of images>
python image_download_python3.py <query> <number of images>

query is the image name your looking for and the number of images is 1000. In my example above my query is pizza and I want 1000 images of it

Lance Samaria
  • 17,576
  • 18
  • 108
  • 256
-1

there's other libraries on github - this looks quite good https://github.com/Achillefs/google-cse

g = GoogleCSE.image_search('Ian Kilminster')
img = g.fetch.results.first.link
file = img.split('/').last
File.open(file,'w') {|f| f.write(open(img).read)} 
`open -a Preview #{file}`
johndpope
  • 5,035
  • 2
  • 41
  • 43