0

I have the following Python code that runs in Jupyter Notebook. It downloads a tar file from the source location, untars it and uploads to Azure Blob storage.

import os
import tarfile
from azure.storage.blob import BlobClient

def upload_folder(local_path):
    connection_string = "XXX"
    container_name = "mycontainername"
    
    with tarfile.open(local_path, "r") as file:
        for each in file.getnames():
            print(each)
            file.extract(each)          
            blob = BlobClient.from_connection_string(connection_string,
                                                     container_name=container_name,
                                                     blob_name=each)

            with open(each, "rb") as f:
                blob.upload_blob(f, overwrite=True)
            os.remove(each)


# MAIN
!wget https://path/to/myarchive.tar.gz

local_path = "myarchive.tar.gz"

upload_folder(local_path)

!rm -rf myarchive.tar.gz
!rm -rf myarchive

The myarchive.tar.gz takes 1Gb, which corresponds to approximately 4Gb of uncompressed data. The problem is that it takes too long to run this code even for such relatively small data volume. It takes around 5-6 hours.

What am I doing wrong? Is there any way to optimise my code to run it faster?

Fluxy
  • 2,838
  • 6
  • 34
  • 63
  • It looks like you're extracting each file individually from the tar, then starting over on the next file. That won't be efficient. There has to be a way to get them all in one pass, but I'm not familiar with `tarfile` so I can't answer. – Mark Ransom Jan 11 '21 at 04:21

1 Answers1

0

You can make upload file processing as one task and use multiprocessing to create some process pool. Then we can run some task at one time with the pool to add the speed of the script. For more details, please refer to here and here

Jim Xu
  • 21,610
  • 2
  • 19
  • 39