Pyspark: loading a zip file from blob storage

Question

I'm using Pyspark to try to read a zip file from blob storage. I want to unzip the file once loaded, and then write the unzipped CSVs back to blob storage.

I'm following this guidance which explains how to unzip the file once read: https://docs.databricks.com/_static/notebooks/zip-files-python.html

But it doesn't explain how I read the zip from blob. I have the following code

file_location = "path_to_my.zip"
df = sqlContext.read.format("file_location").load

I expected this to load the zip to databricks as df, and from there I could follow the advice from the article to unzip, load the csvs to a dataframe and then write the dataframes back to blob.

Any ideas on how to initially read the zip file from blob using pyspark?

Thanks,

Were you able to resolve the issue? I have similar post [here](https://stackoverflow.com/q/72891590/1232087) if you want to share your thoughts and/or a suggestion. — nam, Jul 07 '22 at 23:11

score 2 · Answer 1 · answered Apr 22 '20 at 12:21

As shown in the first cell of your DataBricks notebook, you need to download the zip file and decompress it somehow. Your case is different because you are using Azure Blob storage and you want to do everything in Python (no other shell application).

This page documents the process for accessing files in Azure Blob Storage. You need to follow these steps:

Install the package azure-storage-blob.
Import the SDK modules and set the necessary credentials (reference).
Create an instance of BlobServiceClient using a connection string:

# Create the BlobServiceClient object which will be used to create a container client
blob_service_client = BlobServiceClient.from_connection_string(connect_str)

Create an instance of BlobClient for the file you want:

blob_client = blob_service_client.get_blob_client(container="container", blob="path_to_my.zip")

Download the blob (the zip file) and unzip it with gzip. I would write something like this:

from pathlib import Path
import gzip

Path("./my/local/filepath.csv").write_bytes(
    gzip.decompress(blob_client.download_blob().readall())
)

Use "./my/local/filepath.csv" to create the DataFrame.

Pyspark: loading a zip file from blob storage

1 Answers1

Linked