Downloading large .bz2 files with Python requests library

Question

I am currently developing a Python script that calls a REST API to download data that is made available everyday through the API. The files that I am trying to download have a '.txt.bz2` extension.

The API documentation recommends to use curl to download data from the API. In particular, the command to download the data that is recommended is:

curl --user Username:Password https://api.endpoint.com/data/path/to/file -o my_filename.txt.bz2

Where, of course, the url of the API data endpoint here is just fictitious.

Since the documentation recommends curl, my current implementation of the Python script leverages the subprocess library to call curl within Python:

import subprocess

def data_downloader(download_url, file_name, api_username, api_password):
    args = ['curl', '--user', f'{api_username}:{api_password}', f'{download_url}', '-o', f'{file_name}']
    subrpocess.call(args)
    return file_name

Since, however, I am extensively using the requests library in other parts of the application that I am developing, mainly to send requests to the API and walking the file system like structure of the API, I have tried to implement the download function using this library as well. In particular, I have been using this other Stackoverflow thread as the reference of my alternative implementation, and the two functions that I have implemented using the requests library look like this:

import requests
import shutil

def download_file(download_url, file_name, api_username, api_password, chunk_size):
    with requests.get(download_url, auth=(api_username, api_password), stream=True) as r: 
        with open(file_name, 'wb') as f:
            for chunk in r.iter_content(chunk_size=chunk_size):
            f.write(chunk) 
    return file_name

def shutil_download(download_url, file_name, api_username, api_password):
    with requests.get(download_url, auth=(api_username, api_password), stream=True) as r: 
        with open(file_name, 'wb') as f: 
            shutil.copyfileobj(r.raw, file_name)
    return file_name

While, however, with the subprocess implementation I am able to download the entire file without any issue, when trying to perform the download using the two requests implementations, I always end up with a downloaded file with a 1Kb dimension, which clearly is wrong since most of the data that I am downloading is >10GB.

I suspect that the issue that I am experiencing is caused by the format of the data that I am attempting to download, as I have seen successful attempts of downloading .zip or .gzip files using the same logic that I am using in the two functions. Hence I am wondering if anyone may have an explanation to the issue that I am experiencing or may provide a working solution to the problem.

UPDATE

I had a chance to discuss the issue with the owner of the API and apparently, upon analysis of the logs on their side, they found out there were some issues on their side that prevented the request to go through. On my side the status code of the request was signalling a successful request, however the returned data was not the correct one.

The two functions that use the requests library work as expected and the issue can be considered solved.

In your place I would consider using urllib instead of requests in this case. You just need to add appropriate headers for basic authentication and perhaps adjust "Content-Type"/"Accept", and catch raw data: e.g. u = urlopen(...); data = u.read(1024); while data: data = u.read(1024); ...; u.close(). Note as well that you can use pycurl if you are willing to switch from requests for some reason. — Dalen, Mar 26 '20 at 12:02
@Dalen Thank you for the input. Could you please elaborate on why urllib is more apt than requests in performing this job? As for your comment on using pycurl, I will look at that as well as I am currently trying to set up different alternatives and test their performance in terms of download speed, so the more alternatives the better. — JCPBBT, Mar 26 '20 at 15:19
The task you have in mind is simple and does not need anything more than the Python's stdlib. urllib is simply sufficient. You do not go hunting butterflies with equipment made for hunting elephants. You need requests when you need sessions, advanced authentication, some crazy proxy stuff, some fency stuff etc. Simple REST APIs are a job for urllib. — Dalen, Mar 26 '20 at 22:23
And, I think you will find out that requests is slowest method because it has multiple layers. pycurl might be the fastest (it's written in C) besides manually opening a socket and talking to the server directly yourself. — Dalen, Mar 26 '20 at 22:27
In the end, you should use that what works and is simplest, nicest, most reliable, dependency free (if possible), not an overkill, and feels best to you personally as a correct choice for the task in question. — Dalen, Mar 26 '20 at 22:41

Downloading large .bz2 files with Python requests library

0 Answers0