Image scraped as HTML page with urlretrieve

Question

I'm trying to scrape this image using urllib.urlretrieve.

>>> import urllib
>>> urllib.urlretrieve('http://i9.mangareader.net/one-piece/3/one-piece-1668214.jpg', 
        path) # path was previously defined

This code successfully saves the file in the given path. However, when I try to open the file, I get:

Could not load image 'imagename.jpg':
    Error interpreting JPEG image file (Not a JPEG file: starts with 0x3c 0x21)

When I do file imagename.jpg in my bash terminal, I get imagefile.jpg: HTML document, ASCII text.

So how do I scrape this image as a JPEG file?

No problems with `requests`, by the way: http://stackoverflow.com/questions/16694907/how-to-download-large-file-in-python-with-requests-py. — alecxe, Jul 13 '16 at 23:29
Thanks, I'll give it a shot. Any idea why this isn't working, though? Am I doing something wrong, or did I misunderstand how urlretrieve works? — NJay, Jul 13 '16 at 23:32

score 1 · Accepted Answer · answered Jul 14 '16 at 00:13

1

It's because the owner of the server hosting that image is deliberately blocking access from Python's urllib. That's why it's working with requests. You can also do it with pure Python, but you'll have to give it an HTTP User-Agent header that makes it look like something other than urllib. For example:

import urllib2
req = urllib2.Request('http://i9.mangareader.net/one-piece/3/one-piece-1668214.jpg')
req.add_header('User-Agent', 'Feneric Was Here')
resp = urllib2.urlopen(req)
imgdata = resp.read()
with open(path, 'wb') as outfile:
    outfile.write(imgdata)

So it's a little more involved to get around, but still not too bad.

Note that the site owner probably did this because some people had gotten abusive. Please don't be one of them! With great power comes great responsibility, and all that.

answered Jul 14 '16 at 00:13

Feneric

853
1
11
15

Abusive? How so? Too many hits on the server because of excessive scraping? – NJay Jul 14 '16 at 13:04
And if the user blocked access with urllib, why hasn't he done the same with requests? – NJay Jul 14 '16 at 13:05
@NJay while I can't speak for the motivations of that particular server admin, I have some guesses based on what I've seen on other servers. 1) Excessive scraping could be the problem for popular sites as usually they pay for bandwidth (and some people have been known to grab whole sites over short periods of time); 2) Some sites set blanket blocks trying to stop harvesting spiders; 3) Some admins just copy in 3rd-party tools that "prevent abuse" without questioning what they do or understanding the bigger situation. – Feneric Jul 14 '16 at 13:22
As for why requests isn't being blocked, it'll likely vary with the answer above. For 1 & 2, it's likely that requests just isn't used as much as the built-in urllib and hasn't registered as a problem for them yet. For 3, it's probably that requests didn't even exist when the tool was written. – Feneric Jul 14 '16 at 13:24
Interesting. Thanks! :) – NJay Jul 16 '16 at 18:20

Image scraped as HTML page with urlretrieve

1 Answers1