3

I have a use case, where a large remote file needs to be downloaded in parts, by using multiple threads. Each thread must run simultaneously (in parallel), grabbing a specific part of the file. The expectation is to combine the parts into a single (original) file, once all parts were successfully downloaded.

Perhaps using the requests library could do the job, but then I am not sure how I would multithread this into a solution that combines the chunks together.

url = 'https://url.com/file.iso'
headers = {"Range": "bytes=0-1000000"}  # first megabyte
r = get(url, headers=headers)

I was also thinking of using curl where Python would orchestrate the downloads, but I am not sure that's the correct way to go. It just seems to be too complex and swaying away from the vanilla Python solution. Something like this:

curl --range 200000000-399999999 -o file.iso.part2

Can someone explain how you'd go about something like this? Or post a code example of something that works in Python 3? I usually find the Python-related answers quite easily, but the solution to this problem seems to be eluding me.

John Kugelman
  • 349,597
  • 67
  • 533
  • 578
jjj
  • 2,594
  • 7
  • 36
  • 57
  • What about [this answer](https://stackoverflow.com/questions/13973188/requests-with-multiple-connections)? – bug Oct 26 '19 at 13:58
  • That seems to be Python 2 related and wouldn't work in Python 3 – jjj Oct 26 '19 at 14:01
  • This is pointless. The network is not multi-threaded. Use a single thread. – user207421 Aug 12 '22 at 09:56
  • @user207421 users and devs of Aria2 would tend to disagree. Also, plenty of download tools support resuming downloads. How do you think that works, exactly? And networks are absolutely multi-threaded. How do you think apis such as fastapi send out concurrent responses to multiple concurrent clients? Does your web server only serve 1 client at a time (first come, first serve)? For example: Nginx has worker_connections (and rate-limiting) for a reason. FYI: "Threads" in networks are called sockets. And no, IDC if your rep is over 300k. – DataMinion May 05 '23 at 13:15

4 Answers4

10

Here is a version using Python 3 with Asyncio, it's just an example, it can be improved, but you should be able to get everything you need.

  • get_size: Send an HEAD request to get the size of the file
  • download_range: Download a single chunk
  • download: Download all the chunks and merge them
import asyncio
import concurrent.futures
import functools
import requests
import os


# WARNING:
# Here I'm pointing to a publicly available sample video.
# If you are planning on running this code, make sure the
# video is still available as it might change location or get deleted.
# If necessary, replace it with a URL you know is working.
URL = 'https://download.samplelib.com/mp4/sample-30s.mp4'
OUTPUT = 'video.mp4'


async def get_size(url):
    response = requests.head(url)
    size = int(response.headers['Content-Length'])
    return size


def download_range(url, start, end, output):
    headers = {'Range': f'bytes={start}-{end}'}
    response = requests.get(url, headers=headers)

    with open(output, 'wb') as f:
        for part in response.iter_content(1024):
            f.write(part)


async def download(run, loop, url, output, chunk_size=1000000):
    file_size = await get_size(url)
    chunks = range(0, file_size, chunk_size)

    tasks = [
        run(
            download_range,
            url,
            start,
            start + chunk_size - 1,
            f'{output}.part{i}',
        )
        for i, start in enumerate(chunks)
    ]

    await asyncio.wait(tasks)

    with open(output, 'wb') as o:
        for i in range(len(chunks)):
            chunk_path = f'{output}.part{i}'

            with open(chunk_path, 'rb') as s:
                o.write(s.read())

            os.remove(chunk_path)


if __name__ == '__main__':
    executor = concurrent.futures.ThreadPoolExecutor(max_workers=3)
    loop = asyncio.new_event_loop()
    run = functools.partial(loop.run_in_executor, executor)

    asyncio.set_event_loop(loop)

    try:
        loop.run_until_complete(
            download(run, loop, URL, OUTPUT)
        )
    finally:
        loop.close()
bug
  • 3,900
  • 1
  • 13
  • 24
  • This gave me an error `ValueError: Set of coroutines/Futures is empty` – West Aug 12 '22 at 04:18
  • This happens when `chunks` is empty, it may be due to `file_size` (coming from the header `Content-Length`) being 0 – bug Aug 12 '22 at 09:41
  • 1
    That could be because the url `https://file-examples.com/storage/fe2ef7477862f581f9ce264/2017/04/file_example_MP4_1920_18MG.mp4` is broken. Just pick a different url that you know is working. – Inspired_Blue Sep 30 '22 at 13:22
  • Thanks @Inspired_Blue, that is definitely something to keep in mind. I've replaced the broken URL, but these public sample videos keep changing, so it might break again. – bug Sep 30 '22 at 14:03
  • @bug: I have modified your answer to use `ThreadPoolExercutor` instead of `asyncio`, just to illustrate that is _also_ possible. There is no compelling reason to use `ThreadPoolExercutor`. Its just that I feel `ThreadPoolExercutor` is more beginner friendly and has simpler interface. But I also feel that `asyncio` is more powerful. Here is [my answer](https://stackoverflow.com/a/73909335/4949315). And I also show how to display a progress bar using `tqdm`. – Inspired_Blue Sep 30 '22 at 14:38
2

The best way i found is to use a module called pySmartDL.

Edit: This module has some issue like there is no way to pause the download and resume it later, also the project isn't actively maintained anymore.

So if you are looking for such features I would like to suggest you try pypdl instead but be aware that it doesn't have some advanced features that pySmartDL offers though for most folks pypdl would be better.

  • pypdl can pause/resume downloads

  • pypdl can retry download incase of failure and a option to continue downloading using a different URL if necessary

and many more ...

How to install pypdl

step 1: pip install pypdl

step 2: for downloading the file you could use

from pypdl import Downloader

dl = Downloader()
dl.start('http://example.com/file.txt', 'file.txt')

Note: This gives you a download meter by default.

In case you need to hook the download progress to a gui you could use

dl = Downloader()
dl.start('http://example.com/file.txt', 'file.txt', block=False, display=False)
while d.progress != 100:
    print(d.progress)

if you want to use more threads you can use

dl = Downloader()
dl.start('http://example.com/file.txt', 'file.txt', num_connections=8)

you can find many more features from the project page: https://pypi.org/project/pypdl/

Jishnu
  • 106
  • 9
1

You could use grequests to download in parallel.

import grequests

URL = 'https://cdimage.debian.org/debian-cd/current/amd64/iso-cd/debian-10.1.0-amd64-netinst.iso'
CHUNK_SIZE = 104857600  # 100 MB
HEADERS = []

_start, _stop = 0, 0
for x in range(4):  # file size is > 300MB, so we download in 4 parts. 
    _start = _stop
    _stop = 104857600 * (x + 1)
    HEADERS.append({"Range": "bytes=%s-%s" % (_start, _stop)})


rs = (grequests.get(URL, headers=h) for h in HEADERS)
downloads = grequests.map(rs)

with open('/tmp/debian-10.1.0-amd64-netinst.iso', 'ab') as f:
    for download in downloads:
        print(download.status_code)
        f.write(download.content)

PS: I did not check if the Ranges are correctly determinded and if the downloaded md5sum matches! This should just show in general how it could work.

Maurice Meyer
  • 17,279
  • 4
  • 30
  • 47
  • This is exactly what I needed. BTW. This is great, but if you have a second to amend the code to show the progress of each of the downloading parts, that'd be awesome. – jjj Oct 27 '19 at 09:51
  • You could try this: https://stackoverflow.com/questions/33703730/adding-progress-feedback-in-grequests-task – Maurice Meyer Oct 27 '19 at 10:06
  • I found an issue with this script is that the combined download file doesn't match the byte size of the original. For the file, you've shows (iso) the total size = 351272960 bytes, but the downloaded file is 3 bytes longer: 351272963 bytes. – jjj Oct 27 '19 at 13:58
1

You can also you use ThreadPoolExecutor (or ProcessPoolExecutor) from concurrent.futures instead of using asyncio. The following shows how to modify bug's answer by using ThreadPoolExecutor:

Bonus: The following snippet also uses tqdm to show a progress bar of the download. If you don't want to use tqdm then just comment out the block below with tqdm(total=file_size . . .. More information on tqdm is here which can be installed with pip install tqdm. Btw, tqdm can also be used with asyncio.

import requests
import concurrent.futures
from concurrent.futures import as_completed
from tqdm import tqdm
import os

def download_part(url_and_headers_and_partfile):
    url, headers, partfile = url_and_headers_and_partfile
    response = requests.get(url, headers=headers)
    # setting same as below in the main block, but not necessary:
    chunk_size = 1024*1024 

    # Need size to make tqdm work.
    size=0 
    with open(partfile, 'wb') as f:
        for chunk in response.iter_content(chunk_size):
            if chunk:
                size+=f.write(chunk)
    return size

def make_headers(start, chunk_size):
    end = start + chunk_size - 1
    return {'Range': f'bytes={start}-{end}'}

url = 'https://download.samplelib.com/mp4/sample-30s.mp4'
file_name = 'video.mp4'
response = requests.get(url, stream=True)
file_size = int(response.headers.get('content-length', 0))
chunk_size = 1024*1024

chunks = range(0, file_size, chunk_size)
my_iter = [[url, make_headers(chunk, chunk_size), f'{file_name}.part{i}'] for i, chunk in enumerate(chunks)] 

with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    jobs = [executor.submit(download_part, i) for i in my_iter]

    with tqdm(total=file_size, unit='iB', unit_scale=True, unit_divisor=chunk_size, leave=True, colour='cyan') as bar:
        for job in as_completed(jobs):
            size = job.result()
            bar.update(size)

with open(file_name, 'wb') as outfile:
    for i in range(len(chunks)):
        chunk_path = f'{file_name}.part{i}'
        with open(chunk_path, 'rb') as s:
            outfile.write(s.read())
        os.remove(chunk_path)
Inspired_Blue
  • 2,308
  • 3
  • 15
  • 21