Rather than doing this in shell with regexps, you may have an easier time if you use something that can actually parse the HTML itself, like PHP's DOMDocument class.
If you're stuck using only shell and need to slurp image URLs, you may not be totally out of luck. Regular Expressions are inappropriate for parsing HTML, because HTML is not a regular language. But you may still be able to get by if your input data is highly predictable. (There is no guarantee of this, because Google updates their products and services regularly and often without prior announcement.)
That said, in the output of the URL you provided in your question, each image URL seems to be embedded in an anchor that links to /imgres?…
. If we can parse those links, we can probably gather what we need from them. Within those links, image URLs appear to be preceded with &imgurl=
. So let's scrape this.
#!/usr/local/bin/bash
# Possibly violate Google's terms of service by lying about our user agent
agent="Mozilla/5.0 (X11; FreeBSD amd64; rv:12.0) Gecko/20100101 Firefox/12.0"
# Search URL
url="http://www.google.com/search?hl=en&q=panda&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&biw=1287&bih=672&um=1&ie=UTF-8&tbm=isch&source=og&sa=N&tab=wi&ei=qW4FUJigJ4jWtAbToInABg"
curl -A "$agent" -s -D- "$url" \
| awk '{gsub(/<a href=/,"\n")} 1' \
| awk '
/imgres/ {
sub(/" class=rg_l >.*/, ""); # clean things up
split($0, fields, "\&"); # gather the "GET" fields
for (n=1; n<=length(fields); n++) {
split(fields[n], a, "="); # split name=value pair
getvars[a[1]]=a[2]; # store in array
}
print getvars["imgurl"]; # print the result
}
'
I'm using two awk
commands because ... well, I'm lazy, and that was the quickest way to generate lines in which I could easily find the "imgres" string. One could spend more time on this cleaning it up and making it more elegant, but the law of diminishing returns dictates that this is as far as I go with this one. :-)
This script returns a list of URLs that you could download easily using other shell tools. For example, if the script is called getimages
, then:
./getimages | xargs -n 1 wget
Note that Google appears to be handing me only 83 results (not 1000) when I run this with the search URL you specified in your question. It's possible that this is just the first page that Google would generally hand out to a browser before "expanding" the page (using JavaScript) when I get near the bottom. The proper way to handle this would be to use Google's search API, per Pavan's answer, and to PAY google for their data if you're making more than 100 searches per day.