Using python (urllib/urllib2) to download images is very slow

Question

i'm trying to download magic the gathering cards' images from scryfall.com. they provide this json file with all informations about every single card (including the url for its image). so i wrote a code that reads every url from that json file, and attemps to save it. the thing is, the request part of the code takes more than 5 minutes per image to run and i have no idea why. (the size of each image i'm fetching is less than 100kB and opens instantenously on the browser)

i have tried urllib.urlretrieve, urllib2.urlopen, and it's all the same. tried running it on both python2 and python3.

no error messages, the code actually works, only the long time it takes makes it unviable to carry on with it.

edit:

a=open("cards.json")
b=a.read()

data=[]
data.append(b)

count=0
for elem in data:
    try:
        content=json.loads(elem)
    except:
        print content
        exit()
    for j in content:
        count=count+1
        if j['layout']=='normal' and j['digital']==False:
            url=str(j['image_uris']['normal'])
            final=url[url.find('normal')+6:]
            print (url)
            print("a")
            i1=urllib.urlretrieve(url)
            print("b")
            i2=i1.read()
            file=open(str(count),'wb')
            file.write(i2)
            file.close()


        if count>5:
            exit()

edit2: the link to the json i'm using: https://archive.scryfall.com/json/scryfall-default-cards.json

Not much anyone can suggest without at least example code, timings, bandwidth figures, size of images. — Paula Thomas, Jul 05 '19 at 03:03
well, i don't know what else can i tell you about the code. I tried those two commands (also another one with the requests lib) inside loops and they were the slow step of the execution. The sizes of images i also mentioned. The files are no greater than 100kB each. it takes more than 5 minutes to show an image (with PIL) after the request, but it's the request step that is very slow, not the im.show() command. @PaulaThomas — David Spira, Jul 05 '19 at 03:10
What you can do is to put the code up, see almost any other question on here. — Paula Thomas, Jul 05 '19 at 03:18
do you have the same problem when you download it with web browser or other tools like `wget` or `curl` ? — furas, Jul 05 '19 at 03:32
You aren't using 'final' anywhere. It also looks like you are grabbing the page and the grabbing the image. There are lots of different answers to this type of issue to check out here already https://stackoverflow.com/questions/3042757/downloading-a-picture-via-urllib-and-python — , Jul 05 '19 at 03:33
Except I can't! Could you put up at least the structure of 'cards.js' — Paula Thomas, Jul 05 '19 at 03:33
to get data from file you need only `content = json.loads(open("cards.json").read())`. You don't need list and `append()` and later `for` loop. — furas, Jul 05 '19 at 03:35
@PaulaThomas Try with this url: https://img.scryfall.com/cards/normal/front/2/c/2c23b39b-a4d6-4f10-8ced-fa4b1ed2cf74.jpg?1561567651 — David Spira, Jul 05 '19 at 03:35
you should add link to `cards.json` so everyone could download it and test code. — furas, Jul 05 '19 at 03:36
@furas pasting that url on the browser opens the image very quickly — David Spira, Jul 05 '19 at 03:37
@MikeSperry the final variable is just to try and specify the path to save the file, which i want to be similar to the path of the actual image. — David Spira, Jul 05 '19 at 03:37
when I try `requests.get( img.scryfall.com/cards/normal/front/2/c/…)` then I get it in less then 1 second. — furas, Jul 05 '19 at 03:38
@furas i tried adding this if j['layout']=='normal' and j['digital']==False: url=str(j['image_uris']['normal']) final=url[url.find('normal')+6:] print (url) **im=requests.get(url) print 'a'** to check whether the "a" would be printed quickly, but it doesn't. the code gets stuck on the request part all the same — David Spira, Jul 05 '19 at 03:44
It still strikes me that the important file here is cards.json please post the structure and size of this file. — Paula Thomas, Jul 05 '19 at 03:46
WAIT the file you are extracting the urls from is 140MB! How many urls are in there?!? — Paula Thomas, Jul 05 '19 at 03:55
@PaulaThomas I checked it - there are almost 47237 urls and 41805 urls meet the requirements `j['layout']=='normal' and j['digital']==False` so they will be downloaded — furas, Jul 05 '19 at 04:01
OK now multiply 41805 by 0.5 (assuming the site doesn't have downloading restrictions) and I think my work is done here! — Paula Thomas, Jul 05 '19 at 04:41

furas · Accepted Answer · 2019-07-05T04:11:32.827

This code gets image in less then 1 second

import requests

url = 'https://img.scryfall.com/cards/normal/front/2/c/2c23b39b-a4d6-4f10-8ced-fa4b1ed2cf74.jpg?1561567651'
r = requests.get(url)

with open('image.jpg', 'wb') as f:
    f.write(r.content)

The same with this code

import urllib.request

url = 'https://img.scryfall.com/cards/normal/front/2/c/2c23b39b-a4d6-4f10-8ced-fa4b1ed2cf74.jpg?1561567651'
urllib.request.urlretrieve(url, 'image.jpg')

I didn't check for more images. Maybe problem is when server see too much requests from one IP in short time and then it blocks them.

EDIT: I used this code to download 10 images and display time

import urllib.request
import time
import json

print('load json')

start = time.time()
content = json.loads(open("scryfall-default-cards.json").read())
end = time.time()
print('time:', end-start)

# ---

start = time.time()

all_urls = len(content)

urls_to_download = 0
for item in content:
    if item['layout'] == 'normal' and item['digital'] is False:
        urls_to_download += 1

print('urls:', 

all_urls, urls_to_download)

end = time.time()
print('time:', end-start)

# ----

start = time.time()
count = 0
for item in content:
    if item['layout'] == 'normal' and item['digital'] is False:
        count += 1
        url = item['image_uris']['normal']
        name = url.split('?')[0].split('/')[-1]
        print(name)
        urllib.request.urlretrieve(url, 'imgs/' + name)
    if count >= 10:
        break
end = time.time()
print('time:', end-start)

Results

load json
time: 3.9926743507385254
urls: 47237 41805
time: 0.054879188537597656
2c23b39b-a4d6-4f10-8ced-fa4b1ed2cf74.jpg
37bc0128-a8d0-477c-abcf-2bdc9e38b872.jpg
2ae1bb79-a931-4d2e-9cc9-a06862dc5cde.jpg
4889a668-0f01-4447-ad2e-91b329258f22.jpg
5b13ba5a-f4b0-420a-9e4f-a65e57721fa4.jpg
893b309d-5e8f-47fa-9f54-eaf16a5f96e3.jpg
27d30285-7729-4130-a768-71867aefe9b3.jpg
783616d6-e3ea-43fd-97eb-6e4c5a2c711f.jpg
cc101b90-3e17-4beb-a606-3e76088e362c.jpg
36da00e3-3ef6-4ad5-a53d-e71cfdafc1e6.jpg
42e1033b-383e-49b4-875f-ccdc94e08c9d.jpg
time: 2.656561851501465

I added code which I used to download 10 images and test time. — furas, Jul 05 '19 at 04:12

score 2 · Answer 2 · answered Jul 05 '19 at 03:48

2

Here is a perfectly simple and valid way to grab these images very quickly. I didn't time it, but it was also less than a second.

from urllib import request 

url = 'https://img.scryfall.com/cards/normal/front/2/c/2c23b39b-a4d6-4f10-8ced-fa4b1ed2cf74.jpg?1561567651'

f = open('00000001.jpg', 'wb')
f.write(request.urlopen(url).read())
f.close()

answered Jul 05 '19 at 03:48

i copy pasted it, am running it right now and it is actually taking a while, over 2 minutes now. has anybody any idea what the hell is wrong here? – David Spira Jul 05 '19 at 03:57
1

The answer from furas that you may have exceeded the request limit for a single IP is a good candidate for the right answer. Especially if you were trying to grab every image listed in that json file in one go. Those type of limits often reset in 24 hours (set by an admin on their site, so I don't know for sure), but you'd quickly blow past it again. – Jul 05 '19 at 03:58
i just tried your code with an image url from another website, and it worked pretty fast. thank you both. edit: by the way, is there a way i can hide my ip so i can bypass this blockade? – David Spira Jul 05 '19 at 04:00
1

@DavidSpira you would have to use proxy servers to have differen IPs. But I don't know good and free proxy servers. There are portals which have list of free proxy servers but I never checked them. There are also portals which gives cheap packages of "tested" proxies but I never checked them too. – furas Jul 05 '19 at 04:05

Using python (urllib/urllib2) to download images is very slow

2 Answers2