I'm working on a program that uses Beautiful Soup to scrape a website, and then urllib to retrieve images found on the website (using the image's direct URL). The website I'm scraping isn't the original host of the image, but does link to the original image. The problem I've run into is that for certain websites retrieving www.example.com/images/foobar.jpg
redirects me to the homepage www.example.com
and produces an empty (0 KB) image. In fact, going to www.example.com/images/foobar.jpg
redirects as well. Interesting on the website I'm scraping, the image shows up normal.
I've seen some examples on SO, but they all explain how to capture cookies, headers, and other similar data from websites while getting around the redirect, and I was unable to get them to work for me. Is there a way to prevent a redirect and get the image stored at www.example.com/images/foobar.jpg
?
This is the block of code that saves the image:
from urllib import urlretrieve
...
for imData in imList:
imurl = imData['imurl']
fName = os.path.basename(URL)
fName,ext = os.path.splitext(fName)
fName += "_%02d"%(ctr,)+ext
urlretrieve(imurl,fName)
ctr += 1
The code that handles all the scraping is too long too reasonably put here. But I have verified that in imData['imurl'] holds the accurate url for the image, for example http://upload.wikimedia.org/wikipedia/commons/9/95/Brown_Bear_cub_in_river_1.jpg. However certain images redirect like: http://www.public-domain-image.com/public-domain-images-pictures-free-stock-photos/fauna-animals-public-domain-images-pictures/bears-public-domain-images-pictures/brown-bear-in-dog-salmon-creek.jpg.