0

I want to read the last modified datetime of the files in data lake in a databricks script. If I could read it efficiently as a column when reading data from data lake, it would be perfect.
Thank you:)

enter image description here

UPDATE: If you're working in Databricks, since Databricks runtime 10.4 released on Mar 18, 2022, dbutils.fs.ls() command returns “modificationTime” of the folders and files as well: enter image description here

ARCrow
  • 1,360
  • 1
  • 10
  • 26
  • will this help https://stackoverflow.com/questions/61317600/how-to-get-files-metadata-when-retrieving-data-from-hdfs/61423874#61423874 ? – Srinivas Jun 16 '21 at 15:50
  • @Srinivas Thank you for your comment. I'm limited to using pyspark and the dbutils.fs.ls which gives out some metadata about files doesn't contain last modified datetime only file size and path. Do you happen to know how I can replicate your logic in pyspark? – ARCrow Jun 16 '21 at 16:58
  • see the linked answer – Alex Ott Jun 17 '21 at 12:59

2 Answers2

1

We can get those details using a Python code as we don't have direct method to get the modified time and date of the files in data lake

Here is the code

from pyspark.sql.functions import col
from azure.storage.blob import BlockBlobService
from datetime import datetime
import os.path

block_blob_service = BlockBlobService(account_name='account-name', account_key='account-key')
container_name ='container-firstname'
second_conatainer_name ='container-Second'
#block_blob_service.create_container(container_name)
generator = block_blob_service.list_blobs(container_name,prefix="Recovery/")
report_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')


for blob in generator:
    length = BlockBlobService.get_blob_properties(block_blob_service,container_name,blob.name).properties.content_length
    last_modified = BlockBlobService.get_blob_properties(block_blob_service,container_name,blob.name).properties.last_modified
    file_size = BlockBlobService.get_blob_properties(block_blob_service,container_name,blob.name).properties.content_length
    line = container_name+'|'+second_conatainer_name+'|'+blob.name+'|'+ str(file_size) +'|'+str(last_modified)+'|'+str(report_time)
    print(line)

For more details, refer to the SO thread which addressing similar issue.

CHEEKATLAPRADEEP
  • 12,191
  • 1
  • 19
  • 42
SaiSakethGuduru
  • 2,218
  • 1
  • 5
  • 15
  • If you're working in Databricks, since Databricks runtime 10.4 released on Mar 18, 2022, dbutils.fs.ls() command returns “modificationTime” of the folders and files as well. – ARCrow Jul 05 '22 at 17:12
1

Regarding the issue, please refer to the following code

URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
conf = sc._jsc.hadoopConfiguration()

conf.set(
  "fs.azure.account.key.<account-name>.dfs.core.windows.net",
  "<account-access-key>")

fs = Path('abfss://<container-name>@<account-name>.dfs.core.windows.net/<file-path>/').getFileSystem(sc._jsc.hadoopConfiguration())

status=fs.listStatus(Path('abfss://<container-name>@<account-name>.dfs.core.windows.net/<file-path>/'))

for i in status:
  print(i)
  print(i.getModificationTime())

enter image description here

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Jim Xu
  • 21,610
  • 2
  • 19
  • 39
  • If you're working in Databricks, since Databricks runtime 10.4 released on Mar 18, 2022, dbutils.fs.ls() command returns “modificationTime” of the folders and files as well. – ARCrow Jul 05 '22 at 17:12