0

similar to Try to scrape image from image url (using python urllib ) but get html instead , but the solution does not work for me.

from BeautifulSoup import BeautifulSoup
import urllib2
import requests

img_url='http://7-themes.com/data_images/out/79/7041933-beautiful-backgrounds-wallpaper.jpg'

r = requests.get(img_url, allow_redirects=False)

headers = {}
headers['Referer'] = r.headers['location']

r = requests.get(img_url, headers=headers)
with open('7041933-beautiful-backgrounds-wallpaper.jpg', 'wb') as fh:
    fh.write(r.content)

the downloaded file is still a html page, not an image.

Community
  • 1
  • 1
alec.tu
  • 1,647
  • 2
  • 20
  • 41
  • 1
    because this web site has redirection mechanism mean if you will hit resource directly it will redirect you to the HTML page. so from code when u request this image resource server redirect to html page and then we get this html file. not image file. – Deepak Sharma Sep 27 '16 at 04:27
  • so there is no solution for this web site? – alec.tu Sep 27 '16 at 04:28
  • Usually the solution is to replicate what your browser does. So fire up chrome, open the developer tools, switch to the network tab. Then load the page that hosts that image. What usually happens is there is some sort of cookie (or other HTTP artefact) created on the HTML page, that gets sent with the request for your image. So look at the request that the browser makes for the image, and see what headers and cookies are sent with it. Then look through the rest of the traffic to see where they came from. – GregHNZ Sep 27 '16 at 04:30
  • you want to save the image file locally? – Deepak Sharma Sep 27 '16 at 04:33
  • @DeepakSharma , yup. – alec.tu Sep 27 '16 at 04:35
  • try - http://stackoverflow.com/questions/8286352/how-to-save-an-image-locally-using-python-whose-url-address-i-already-know#answer-8286449 you may get same prob here also but I just want you to make it sure either its working on this one is also failed. – Deepak Sharma Sep 27 '16 at 04:35
  • @DeepakSharma, using urllib.urlretrieve still gets a html page, not an image. – alec.tu Sep 27 '16 at 04:37
  • @GregHNZ, it will open a new tab when I click the image, then the tool box of chrome is gone in the new tab. I can not capture any network resource... – alec.tu Sep 27 '16 at 04:39
  • Can you open the dev tools in the new window, then hit refresh? – GregHNZ Sep 27 '16 at 04:42
  • @GregHNZ, the network resource is still the `img_url`, but I can see some cookies. one is PHPSeedID (only in the session) and the other one is _ga (expire time is 2018 yr). Do I need to specify the cookie in the request header? – alec.tu Sep 27 '16 at 04:50
  • You probably need to code a request to the html page, parse the cookies etc from the response, and add them to your request for the image. – GregHNZ Sep 27 '16 at 04:54
  • @GregHNZ, I put both cookies in the request, but still got a html page. – alec.tu Sep 27 '16 at 05:04
  • Hi guys, I found a solution for this. See the answer below. – alec.tu Sep 27 '16 at 05:08

2 Answers2

0

Your referrer was not being set correctly. I have hard coded the referrer and it works fine

from BeautifulSoup import BeautifulSoup
import urllib2
import requests

img_url='http://7-themes.com/data_images/out/79/7041933-beautiful-backgrounds-wallpaper.jpg'

r = requests.get(img_url, allow_redirects=False)

headers = {}
headers['Referer'] = 'http://7-themes.com/7041933-beautiful-backgrounds-wallpaper.html'

r = requests.get(img_url, headers=headers, allow_redirects=False)
with open('7041933-beautiful-backgrounds-wallpaper.jpg', 'wb') as fh:
    fh.write(r.content)
saurabh baid
  • 1,819
  • 1
  • 14
  • 26
  • yup. I just found my root cause is the `refer` field, but it's not necessary to make two http request. – alec.tu Sep 27 '16 at 05:12
0

I found a root cause in my code is that refer field in the header is still a html, not image.

So I change the refer field to the img_url, and this works.

from BeautifulSoup import BeautifulSoup
import urllib2
import urllib
import requests

img_url='http://7-themes.com/data_images/out/79/7041933-beautiful-backgrounds-wallpaper.jpg'

headers = {}
headers['Referer'] = img_url

r = requests.get(img_url, headers=headers)

with open('7041933-beautiful-backgrounds-wallpaper.jpg', 'wb') as fh:
    fh.write(r.content)
alec.tu
  • 1,647
  • 2
  • 20
  • 41