Python3 Download Incorrectly Encoded Image From URL

Question

The problem I am currently having is trying to download an image that displays as an animated gif, but appears encoded as a jpg. I say that it appears to be encoded as a jpg because the file extension and mime-type are both .jpg add image/jpeg.

When downloading the file to my local machine (Mac OSX), then attempting to open the file I get the error:

The file could not be opened. It may be damaged or use a file format that Preview doesn’t recognize.

While I realize that some people would maybe just ignore that image, if it can be fixed, I'm looking for a solution to do that, not just ignore it.

The url in question is here:

http://www.supergrove.com/wp-content/uploads/2017/03/gif-images-22-1000-about-gif-on-pinterest.jpg

Here is my code, and I am open to suggestions:

from PIL import Image
import requests

response = requests.get(media, stream = True)
response.raise_for_status()

with open(uploadedFile, 'wb') as img:
    for chunk in response.iter_content(chunk_size=1024):
        if chunk:
            img.write(chunk) 
    img.close()

Did you try to download it with right click -> save image as, and see if it opens? In my case (Debian 8), firefox opens it correctly. — raratiru, May 27 '17 at 22:59
@whackamadoodle3000 No difference. That was one of the first things I tried. Also tried changing the file extension to give prior to saving the file to disk. — stwhite, May 27 '17 at 23:05
@raratiru yep I did and that downloads it as jpg and you can open it, but I am trying to do this with Python... — stwhite, May 27 '17 at 23:06
I am not familiar with multiframe images and JPG. However, [Pillow (Fork of PIIL) docs](http://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html#saving), read that by default Pillow only saves the first frame. This is why there is a `save_all` option. The solution may start from this point. — raratiru, May 28 '17 at 12:34
@raratiru though based on my code, I am downloading and saving the whole image to disk and not actually saving using PIL. The image is from a URL and not already on disk. — stwhite, May 29 '17 at 14:03
I am trying hard to find a solution ... [This](https://stackoverflow.com/questions/13137817/how-to-download-image-using-requests?rq=1) is a very nice post, and [this](https://gist.github.com/hanleybrand/4221658) is a very nice script. No result, however ... I tried with [wget](https://pypi.python.org/pypi/wget) which did not succeed but the output is `-1 / unknown`. What might be that? I have tried with the pure image url which is `http://www.supergrove.com/wp-content/uploads/2017/03/gif-images-22-1000-about-gif-on-pinterest.jpg` — raratiru, May 29 '17 at 17:48
@raratiru It's interesting because I can confirm that some sites such as Pinterest actually upload the image correctly.... not sure how they do it though. When I download the image the headers are text/html utf-8 which is strange to me. I also think the image is gzipped. — stwhite, May 29 '17 at 17:50
Indeed! You can paste the `response.content` [here](http://htmledit.squarefree.com/) and see that it is a web page which includes the image. I even tried to put some [headers](https://stackoverflow.com/a/27652558/2996101) to the `requests()` but I receive the same result.This is probably a security measure against bots, isn't it? Maybe you can experiment more with the headers. — raratiru, May 29 '17 at 19:10
OK ... the url I pasted as "pure" is the same with yours. If you visit it, you will get a web page. Wow! The policy is against everybody! — raratiru, May 29 '17 at 19:24
@raratiru yep I'm seeing that as well! However, I'm still able to upload that image to Pinterest without a problem... I'm still trying to experiment with setting custom headers however, it's hard to account for this random image if the headers aren't even the same... because if the image is actually a gif, and the headers are html, then how can we determine the image type? Regardless, there has to be some way to figure out the encoding issue and determine the mime type... — stwhite, May 29 '17 at 20:32

raratiru · Answer 1 · 2017-05-29T23:54:55.713

1

According to Wheregoes, the link of the image:

http://www.supergrove.com/wp-content/uploads/2017/03/gif-images-22-1000-about-gif-on-pinterest.jpg

receives a 302 redirect to the page that contains it:

http://www.supergrove.com/gif-images/gif-images-22-1000-about-gif-on-pinterest/

Therefore, your code is trying to download a web page as an image.

I tried:

r = requests.get(the_url, headers=headers, allow_redirects=False)

But it returns zero content and status_code = 302.

(Indeed that was obvious it should happen ...)

This server is configured in a way that it will never fulfill that request.

Bypassing that limitation sounds ~~illegal~~ difficult, to the best of my -limited perhaps- knowledge.

edited May 29 '17 at 23:54

answered May 29 '17 at 20:45

raratiru

8,748
4
73
113

I've attempted to use `allow_redirects=False` Unfortunately still no image headers: `{'Server': 'nginx', 'Date': 'Mon, 29 May 2017 22:15:29 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Content-Length': '0', 'Connection': 'keep-alive', 'Keep-Alive': 'timeout=60', 'X-Powered-By': 'PHP/5.6.30', 'Location': 'http://www.supergrove.com/gif-images/gif-images-22-1000-about-gif-on-pinterest/'}` – stwhite May 29 '17 at 22:15
At this point I'm really not sure. I've even tried to block redirection, grab the cookies and then request again with the cookies but even that doesn't seem to work (I was under the assumption that cookies were needed to access the image—possibly to prevent web scrapers). – stwhite May 29 '17 at 22:39
@stwhite It is obvious that those people do not want direct access to the image. `allow_redirect=False` returns zero content and `status_code=302`. I am not sure that it is possible to bypass this situation without asking them direct access to the settings of the server! – raratiru May 29 '17 at 22:47
I found a solution that seems to work! Thanks for all your help on this. – stwhite May 29 '17 at 23:31

stwhite · Accepted Answer · 2017-05-29T23:56:50.463

Had to answer my own question in this case, but the answer to this problem, was to add a referer for the request. Most likely an htaccess file preventing some direct file access on the image's server unless the request came from their own server.

from fake_useragent import UserAgent
from io import StringIO,BytesIO
import io
import imghdr
import requests

# Set url
mediaURL = 'http://www.supergrove.com/wp-content/uploads/2017/03/gif-images-22-1000-about-gif-on-pinterest.jpg'

# Create a user agent
ua = UserAgent()

# Create a request session
s = requests.Session()

# Set some headers for the request
s.headers.update({ 'User-Agent': ua.chrome, 'Referrer': media })


# Make the request to get the image from the url
response = s.get(mediaURL, allow_redirects=False)


# The request was about to be redirected
if response.status_code == 302:

    # Get the next location that we would have been redirected to
    location = response.headers['Location']

    # Set the previous page url as referer
    s.headers.update({'referer': location})

    # Try the request again, this time with a referer
    response = s.get(mediaURL, allow_redirects=False, cookies=response.cookies)

    print(response.headers)

Hat tip to @raratiru for suggesting the use of allow_redirects.

Also noted in their answer is that the image's server might be intentionally blocking access to prevent general scrapers from viewing their images. Hard to tell, but regardless, this solution works.

Python3 Download Incorrectly Encoded Image From URL

2 Answers2