I have the following Python code that runs in Jupyter Notebook. It downloads a tar
file from the source location, untars it and uploads to Azure Blob storage.
import os
import tarfile
from azure.storage.blob import BlobClient
def upload_folder(local_path):
connection_string = "XXX"
container_name = "mycontainername"
with tarfile.open(local_path, "r") as file:
for each in file.getnames():
print(each)
file.extract(each)
blob = BlobClient.from_connection_string(connection_string,
container_name=container_name,
blob_name=each)
with open(each, "rb") as f:
blob.upload_blob(f, overwrite=True)
os.remove(each)
# MAIN
!wget https://path/to/myarchive.tar.gz
local_path = "myarchive.tar.gz"
upload_folder(local_path)
!rm -rf myarchive.tar.gz
!rm -rf myarchive
The myarchive.tar.gz
takes 1Gb, which corresponds to approximately 4Gb of uncompressed data.
The problem is that it takes too long to run this code even for such relatively small data volume. It takes around 5-6 hours.
What am I doing wrong? Is there any way to optimise my code to run it faster?