2

I'd like to use the Python bindings to delta-rs to read from my blob storage.

Currently I am kind of lost, since I cannot figure out how to configure the filesystem on my local machine. Where do I have to put my credentials?

Can I use adlfs for this?

from adlfs import AzureBlobFileSystem
    
fs = AzureBlobFileSystem(
        account_name="...", 
        account_key='...'
    )

and then use the fs object?

Hongbo Miao
  • 45,290
  • 60
  • 174
  • 267
user7454972
  • 218
  • 3
  • 13

4 Answers4

2

Unfortunately we don't have great documentation around this at the moment. You should be able to set AZURE_STORAGE_ACCOUNT and AZURE_STORAGE_SAS environment variables a la this integration test.

That will ensure the Python bindings can access table metadata, but typically fetching of the data for query is done through Pandas, and I'm not sure if Pandas will handle these variables as well (not an ADLSv2 user myself)..

rtyler
  • 36
  • 2
0

One possible workaround is to download the delta lake files to a tmp-dir and read the files using python-delta-rs with something like this:

from azure.storage.blob import BlobServiceClient
import tempfile
from deltalake import DeltaTable

def get_blobs_for_folder(container_client, blob_storage_folder_path):
    blob_iter = container_client.list_blobs(name_starts_with=blob_storage_folder_path)
    blob_names = []
    for blob in blob_iter:
        if "." in blob.name:
            # To just get files and not directories, there might be a better way to do this
            blob_names.append(blob.name)

    return blob_names


def download_blob_files(container_client, blob_names, local_folder):
    for blob_name in blob_names:
        local_filename = os.path.join(local_folder, blob_name)
        local_file_dir = os.path.dirname(local_filename)
        if not os.path.exists(local_file_dir):
            os.makedirs(local_file_dir)

        with open(local_filename, 'wb') as f:
            f.write(container_client.download_blob(blob_name).readall())


def read_delta_lake_file_to_df(blob_storage_path, access_key):
    blob_storage_url = "https://your-blob-storage"
    blob_service_client = BlobServiceClient(blob_storage_url, credential=access_key)
    container_client = blob_service_client.get_container_client("your-container-name")

    blob_names = get_blobs_for_folder(container_client, blob_storage_path)
    with tempfile.TemporaryDirectory() as tmp_dirpath:
        download_blob_files(container_client, blob_names, tmp_dirpath)
        local_filename = os.path.join(tmp_dirpath, blob_storage_path)
        dt = DeltaTable(local_filename)
        df = dt.to_pyarrow_table().to_pandas()
    return df


0

I don't know about delta-rs but you can use this object directly with pandas.

abfs = AzureBlobFileSystem(account_name="account_name", account_key="access_key", container_name="name_of_container")
df = pd.read_parquet("path/of/file/with/container_name/included",filesystem=abfs)
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Dec 15 '22 at 10:00
0

You can also use the storage_options e.g.

delta_url = f"{protocol}://{container_name}@{storage_account_name}.dfs.core.windows.net/{delta_path}"

# Give SAS_TOKEN as storage option (can be set via ENV variable as well)
storage_options = {"ACCESS_KEY": f"{access_key}"}

# Read the Delta table from the storage account
df = dl.DeltaTable(
    delta_url, storage_options=storage_options).to_pyarrow_table()

With the options being described here

Will W
  • 95
  • 6