0

I'm working on a program that uses Beautiful Soup to scrape a website, and then urllib to retrieve images found on the website (using the image's direct URL). The website I'm scraping isn't the original host of the image, but does link to the original image. The problem I've run into is that for certain websites retrieving www.example.com/images/foobar.jpg redirects me to the homepage www.example.com and produces an empty (0 KB) image. In fact, going to www.example.com/images/foobar.jpg redirects as well. Interesting on the website I'm scraping, the image shows up normal.

I've seen some examples on SO, but they all explain how to capture cookies, headers, and other similar data from websites while getting around the redirect, and I was unable to get them to work for me. Is there a way to prevent a redirect and get the image stored at www.example.com/images/foobar.jpg?

This is the block of code that saves the image:

from urllib import urlretrieve

...

for imData in imList:
    imurl = imData['imurl']
    fName = os.path.basename(URL)
    fName,ext =  os.path.splitext(fName)
    fName += "_%02d"%(ctr,)+ext
    urlretrieve(imurl,fName)
    ctr += 1

The code that handles all the scraping is too long too reasonably put here. But I have verified that in imData['imurl'] holds the accurate url for the image, for example http://upload.wikimedia.org/wikipedia/commons/9/95/Brown_Bear_cub_in_river_1.jpg. However certain images redirect like: http://www.public-domain-image.com/public-domain-images-pictures-free-stock-photos/fauna-animals-public-domain-images-pictures/bears-public-domain-images-pictures/brown-bear-in-dog-salmon-creek.jpg.

wbest
  • 611
  • 1
  • 6
  • 15
  • An actual code you are using would help. But try to set `User-agent` header (see this [thread](http://stackoverflow.com/questions/7933417/how-do-i-set-headers-using-pythons-urllib)). – alecxe Mar 14 '14 at 17:41

1 Answers1

0

The website you are attempting to download the image from may have extra checks to limit the amount of screen scraping. A common check is the Referer header which you can try adding to the urllib request:

req = urllib2.Request('<img url>')
req.add_header('Referer', '<page url / domain>')

For example the request my browser used for this an alpaca image from the website you referenced includes a referer header:

Request URL:http://www.public-domain-image.com/cache/fauna-animals-public-domain-images-pictures/alpacas-and-llamas-public-domain-images-pictures/alpacas-animals-vicugna-pacos_w725_h544.jpg
Request Method:GET
....
Referer:http://www.public-domain-image.com/fauna-animals-public-domain-images-pictures/alpacas-and-llamas-public-domain-images-pictures/alpacas-animals-vicugna-pacos.jpg.html
User-Agent:Mozilla/5.0 
Ric
  • 8,615
  • 3
  • 17
  • 21