How can you reduce/limit bandwidth during scraping of large images?

Question

I'm to download about 10 million images and after a small experiment of downloading the first 1000 I noticed each takes ~4.5 seconds (which maybe could be slightly sped-up with multiprocessing.Pools) but the biggest problem is that the average image size is ~2400x2400 at ~2.2MB. I can resize them as soon as they are downloaded, but the main bottleneck (currently) is internet bandwidth. Is there a way to download the images directly at a lower resolution?

Sample dummy code:

import requests

resp = requests.get("some_url.jpg")
with open(fn, 'wb') as f:
    f.write(resp.content)

Peter Badida · Accepted Answer · 2019-01-02T17:50:55.977

Reducing

Unless there are other files with lower resolution available → no. Unless there is some kind of API or basically anything on the server you want to download the file(image) from that modifies it on the server before sending the content back as a response.

What you can try though is to check if the website supports gzip or other compression and ensure you download the response compressed first, e.g. with this answer and then decompress before saving the file e.g. with gzip or zlib.

For enforcing try to use specific headers such as Accept-Encoding.

Limiting

Make a simple counter for the data (you can count the bytes while processing or after you download) and if you don't want to reach e.g. more than 100MB per 5 minutes or something, then just put time.sleep() for each 100MB chunk downloaded.

Minor notes

Thread will not help you with parallelizing the work, use multiprocessing.Pool or likes to really split the work to multiple processes so that you get from (random numbers) e.g. 100 files per process per 1 minute base to 400 files downloaded with 4 processes at the same time downloading 100 files each.

How can you reduce/limit bandwidth during scraping of large images?

1 Answers1

Reducing

Limiting

Minor notes