2

So currently I have this code, and it works perfectly as I intended for it to work.

import urllib.request
from tqdm import tqdm

with open("output.txt", "r") as file:
    itemIDS = [line.strip() for line in file]

x = 0

for length in tqdm(itemIDS):
    urllib.request.urlretrieve(
        "https://imagemocksite.com?id="+str(itemIDS[x]), 
        "images/"+str(itemIDS[x])+".jpg")
    x += 1

print("All images downloaded")

I was searching around and the solutions I found weren't really what I was looking for. I have 200mbp/s so that's not my issue.

My issue is that my loop iterates 1.1 - 1.57 times per second. I want to make this faster as I have over 5k images I want to download. They are roughly 1-5kb each too.

Also if anyone has any code tips in general, I'd appreciate it! I'm learning python and it's pretty fun so I would like to get better wherever possible!

Edit: Using the info below about asyncio I am now getting 1.7-2.1 It/s which is better! Could it be faster? Maybe I used it wrong?

import urllib.request
from tqdm import tqdm
import asyncio

with open("output.txt", "r") as file:
    itemIDS = [line.strip() for line in file]

async def download():
    x = 0
    for length in tqdm(itemIDS):
        await asyncio.sleep(1)
        urllib.request.urlretrieve(
            "https://imagemocksite.com?id="+str(itemIDS[x]), 
            "images/"+str(itemIDS[x])+".jpg")
        x += 1

asyncio.run(download())
print("All images downloaded")
  • 2
    What if the server is the bottle neck and is rate limiting you. Then you can't make the loop much more efficient. You might be able to download images faster through multiprocessing. – Yoshikage Kira May 18 '21 at 22:29
  • 2
    you wouldn't use `multiprocessing` for this but `threading` @Goion since this is an I/O bound task not CPU bound – gold_cy May 18 '21 at 22:32
  • 4
    As @Goion states it may indeed be limited by the server. To add to his latter comment; typically the [asyncio](https://docs.python.org/3/library/asyncio.html) library (or one with similar functionality like Tornado) is used for these kinds of I/O applications as multiprocessing is more geared towards heavy calculations. In contrast, `asyncio` runs on a single process and single thread, but can 'pause' a function (await it) and go on with the rest of the program whilst waiting for input based on Python coroutines. – jrbergen May 18 '21 at 22:33
  • 1
    mp, threading and async would all be good options. I don't see any reason to pick one over the other besides personal choice. – tdelaney May 18 '21 at 22:35
  • I suggested mp because of GIL limitation. But now that I think multi threading and even async will work since most of the time you will be waiting anyways. Mp might be just waste of resources. – Yoshikage Kira May 18 '21 at 22:36
  • So on the high chance that the server I'm connecting to is the bottleneck, I just have to accept my defeat? Ah, that's unfortunate but understandable – unsettled_duck May 18 '21 at 22:37
  • Yes. You have to keep in mind that the server is also handling other users requests. If a malicious user send a billion requests. They can essentially crash the server. – Yoshikage Kira May 18 '21 at 22:41
  • This can help you https://stackoverflow.com/questions/22190403/how-could-i-use-requests-in-asyncio – Yoshikage Kira May 18 '21 at 23:02
  • There is a nice example in the [concurrent.futures](https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example) showing *parallel/threaded* web requests -you could easily adapt it and do some testing on a portion of your images to see if the server is throttling. – wwii May 18 '21 at 23:19
  • Thank you all for the help! I decided to settle on the asyncio method for now, but once I download all of these images I will practice concurrent futures! – unsettled_duck May 18 '21 at 23:47
  • You need to use what works for you, but the `asyncio` example you're showing above is not any different from your original code. All of the fetches are being done sequentially. – Tim Roberts May 18 '21 at 23:57

1 Answers1

2

Comments have already provided good advice, and I think you're right to use asyncio which is really the typical Python tool for that kind of job.

Just wanted to bring some help since the code you've provided doesn't really use its power.

First you'll have to install aiohttp and aiofiles that handle HTTP requests and local filesystem I/O asynchronously.

Then, define a download(item_id, session) helper coroutine that downloads one single image based on its item_id. session will be a aiohttp.ClientSession which is the base class to run async HTTP requests in aiohttp.

The trick is finally to have a download_all coroutine that calls asyncio.gather on all the individual download() coroutines at once. asyncio.gather is the way to tell asyncio to run several coroutines "in parallel".

This should massively speedup your downloads. If not, then it's the third-party server that is limiting you.

import asyncio

import aiohttp
import aiofiles


with open("output.txt", "r") as file:
    itemIDS = [line.strip() for line in file]


async def download(item_id, session):
    url = "https://imagemocksite.com"
    filename = f"images/{item_id}.jpg"
    async with session.get(url, {"id": item_id}) as response:
        async with aiofiles.open(filename, "wb") as f:
            await f.write(await response.read())


async def download_all():
    async with aiohttp.ClientSession() as session:
        await asyncio.gather(
            *[download(item_id, session) for item_id in itemIDS]
        )


asyncio.run(download_all())
print("All images downloaded")
Roméo Després
  • 1,777
  • 2
  • 15
  • 30