0

I'm running Databricks on Microsoft Azure. I am copying all files from a Databricks dbfs path to a GCP/GCS bucket (Google Cloud Platform / Google Cloud Storage bucket) by using the python libraries named google-cloud-storage and google-cloud-core. I installed these libraries with PyPl and I'm using the 'upload_from_filename' command from the google-cloud-storage python library. The source directory contains hundreds of files and the total volume of files is more than 100 GB. The file copy/upload is successful, but the copy/upload operation occurs one-by-one, one file at a time.

My question is: how can I force Databricks to 'Parallelize' the copy/upload operation (i.e. perform the copy operation in multiple threads asynchronously)?

The following is my code (gcp bucket name and source file paths are modified for clarity)

from google.cloud import storage
from datetime import datetime

storage_client = storage.Client()
bucket = storage_client.bucket('the-gcp-bucket')
files = dbutils.fs.ls('dbfs:/sourcefilepath/')
filenumber = 0

for fi in files:
  source_file_name = fi.path
  source_file_name = source_file_name.replace("dbfs:", "/dbfs")
  blob = bucket.blob('TargetSubFolder/' + fi.name)
  blob.upload_from_filename(source_file_name)
  filenumber = filenumber + 1
  print("File num: {} {} uploaded to {}.".format(str(filenumber), source_file_name, destination_blob_name))
  
print("File Copy Complete")
  • 3
    Does this answer your question? [Does gcloud storage python client API support parallel composite upload?](https://stackoverflow.com/questions/55249311/does-gcloud-storage-python-client-api-support-parallel-composite-upload) – Donnald Cucharo May 06 '21 at 02:44
  • This is also an open issue on GitHub: https://github.com/googleapis/python-storage/issues/36 – Donnald Cucharo May 06 '21 at 02:45

0 Answers0