I am currently developing a Python script that calls a REST API to download data that is made available everyday through the API. The files that I am trying to download have a '.txt.bz2` extension.
The API documentation recommends to use curl
to download data from the API. In particular, the command to download the data that is recommended is:
curl --user Username:Password https://api.endpoint.com/data/path/to/file -o my_filename.txt.bz2
Where, of course, the url of the API data endpoint here is just fictitious.
Since the documentation recommends curl
, my current implementation of the Python script leverages the subprocess
library to call curl
within Python:
import subprocess
def data_downloader(download_url, file_name, api_username, api_password):
args = ['curl', '--user', f'{api_username}:{api_password}', f'{download_url}', '-o', f'{file_name}']
subrpocess.call(args)
return file_name
Since, however, I am extensively using the requests
library in other parts of the application that I am developing, mainly to send requests to the API and walking the file system like structure of the API, I have tried to implement the download function using this library as well. In particular, I have been using this other Stackoverflow thread as the reference of my alternative implementation, and the two functions that I have implemented using the requests
library look like this:
import requests
import shutil
def download_file(download_url, file_name, api_username, api_password, chunk_size):
with requests.get(download_url, auth=(api_username, api_password), stream=True) as r:
with open(file_name, 'wb') as f:
for chunk in r.iter_content(chunk_size=chunk_size):
f.write(chunk)
return file_name
def shutil_download(download_url, file_name, api_username, api_password):
with requests.get(download_url, auth=(api_username, api_password), stream=True) as r:
with open(file_name, 'wb') as f:
shutil.copyfileobj(r.raw, file_name)
return file_name
While, however, with the subprocess
implementation I am able to download the entire file without any issue, when trying to perform the download using the two requests
implementations, I always end up with a downloaded file with a 1Kb dimension, which clearly is wrong since most of the data that I am downloading is >10GB.
I suspect that the issue that I am experiencing is caused by the format of the data that I am attempting to download, as I have seen successful attempts of downloading .zip
or .gzip
files using the same logic that I am using in the two functions. Hence I am wondering if anyone may have an explanation to the issue that I am experiencing or may provide a working solution to the problem.
UPDATE
I had a chance to discuss the issue with the owner of the API and apparently, upon analysis of the logs on their side, they found out there were some issues on their side that prevented the request to go through. On my side the status code of the request was signalling a successful request, however the returned data was not the correct one.
The two functions that use the requests
library work as expected and the issue can be considered solved.