2

I have an Azure DataLake Storage Gen2 which contains a few Parquet files. My Organization has enabled credential passthrough and so I am able to create a python script in Azure Databricks and access the files available in ADLS using dbutils.fs.ls. All these work fine.

Now, I need to access the last modified timestamp of these files too. I found a link that does this. However, it uses BlockBlobService and requires an account_key.

I do not have an account key and can't get one due to security policies of the organization. I am unsure of how to do the same using Credential passthrough. Any ideas here?

Sree Nair
  • 91
  • 4
  • 11
  • Have you referred to https://learn.microsoft.com/en-us/azure/databricks/security/credential-passthrough/adls-passthrough? – Jim Xu Jun 08 '20 at 03:40

2 Answers2

0

You can try to mount the Azure DataLake Storage Gen2 instance with credentials passthrough.

    configs = {
      "fs.azure.account.auth.type": "CustomAccessToken",
      "fs.azure.account.custom.token.provider.class":   spark.conf.get("spark.databricks.passthrough.adls.gen2.tokenProviderClassName")
    }
    mount_name = 'localmountname'
    container_name = 'containername'
    storage_account_name = 'datalakestoragename'
    dbutils.fs.mount(
      source = f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net/",
      mount_point = f"/mnt/{mount_name}>",
      extra_configs = configs)
Axel R.
  • 1,141
  • 7
  • 22
  • Hi Axel - Thank you for your help. I am able to access my mounted drive but only via dbutils.fs.ls(). Via this, I am unable to find the last modified timestamp and hence stuck in the same problem again. I still cannot find a way to access the mounted drive via ls command to read the last modified timestamp. Any ideas here?? – Sree Nair Jun 16 '20 at 06:40
0

You can do this using the Hadoop FileSystem object accessible via Spark:

import time

path = spark._jvm.org.apache.hadoop.fs.Path
fs = path('abfss://container@storageaccount.dfs.core.windows.net/').getFileSystem(sc._jsc.hadoopConfiguration())

res = fs.listFiles(path('abfss://container@storageaccount.dfs.core.windows.net/path'), True)

while res.hasNext():
  file = res.next()
  localTime = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(file.getModificationTime() / 1000))
  print(f"{file.getPath()}: {localTime}")

Note that that the True parameter in the listFiles() method means recursive.