0

I'm trying to download many images from a list of URL. By many, I mean in the vicinity of 10 000. The images vary in size, from a few hundreds of KB to 15 MB. I wonder what would be the best strategy to go about this task, trying to minimize the total time to finish, and to avoid freezing.

I use this function to save each image :

def save_image(name, base_dir, data):
    with open(base_dir + name, "wb+") as destination:
        for chunk in data:
            destination.write(chunk)

I take the file extension from the URL with this function :

def get_ext(url):
    """Return the filename extension from url, or ''."""
    """ From : https://stackoverflow.com/questions/28288987/identify-the-file-extension-of-a-url """
    parsed = urlparse(url)
    root, ext = splitext(parsed.path)
    return ext  # or ext[1:] if you don't want the leading '.'

And to get the images I just do :

for image in listofimages:
    r = requests.get(image["url"], timeout=5)
    extension = get_ext(image["url"])
    name = str( int(image['ID']) ) + "AA" + extension
    save_image( name, "images/", r )

Now putting it all together is quite slow. Hence my question.

GuitarExtended
  • 787
  • 2
  • 11
  • 32
  • What do you mean by slow? Do you have evidence it should be faster? There's a difference between something being slow, and something taking a long time. What do you mean by freezing? It's quite possible there are limitations on network bandwidth, file system access, working memory, CPUs. managing all of these will give you some optimal performance for your situation. You can probably parallelise your code with [`multiprocessing`](https://docs.python.org/3/library/multiprocessing.html) relatively easily which will use multiple CPUs and network connections. – Peter Wood Jun 22 '21 at 10:31
  • https://stackoverflow.com/questions/57205531/python-how-to-download-multiple-files-in-parallel-using-multiprocessing-pool – Peter Wood Jun 22 '21 at 10:32
  • @PeterWood Given how much I/O and network waiting is involved (releasing of the Global Interpreter Lock), I would think that multithreading would be the better model and you could certainly do *way* better by a using a larger pool size than the number of CPU cores you have as is done by the solution you allude to. – Booboo Jun 22 '21 at 10:40
  • @PeterWood by freezing, I mean that my code stops (i.e. doesn't not continue...) after having downloaded about 700 images. I get no error in the console, it's just not "moving on" anymore. Sorry for not being more precise. – GuitarExtended Jun 22 '21 at 10:45

1 Answers1

2

One, as hinted in the above comments, you probably want to parallelize work. Multiprocessing and multithreading will work, but with a relatively high overhead. Alternatively, you could use an asynchronous approach such as patching your network libraries with Gevent, or using asyncio together with an async-aware HTTP client, for instance, httpx would do.

Regardless of the approach you take to parallelize I/O, you might find the queue paradigm convenient to work with -- put all your URLs into a queue, and let your workers consume them.

Two, to deal with non-responsive web servers blocking your workers from scraping, you'll probably need to set socket timeouts, check the chosen HTTP client library how to do it. For instance, the popular requests simply takes the timeout parameter.

Vytas
  • 754
  • 5
  • 14