List All Files in a Folder Sitting in a Data Lake

Question

I'm trying to get an inventory of all files in a folder, which has a few sub-folders, all of which sit in a data lake. Here is the code that I'm testing.

import sys, os
import pandas as pd

mylist = []
root = "/mnt/rawdata/parent/"
path = os.path.join(root, "targetdirectory") 

for path, subdirs, files in os.walk(path):
    for name in files:
        mylist.append(os.path.join(path, name))


df = pd.DataFrame(mylist)
print(df)

I also tried the sample code from this link:

Python list directory, subdirectory, and files

I'm working in Azure Databricks. I'm open to using Scala to do the job. So far, nothing has worked for me. Each time, I keep getting an empty dataframe. I believe this is pretty close, but I must be missing something small. Thoughts?

maybe first check if this folder really exists in system. Maybe it is not folder but file. `os.path.exists(path)`, `os.path.isfile(path)`, `os.path.isdir(path)` — furas, Nov 07 '19 at 15:03
or maybe system mount it only when it need it and it doesn't know that you need it.? Or maybe it reads it from database? — furas, Nov 07 '19 at 15:06
I tried your suggestions. I'm getting the same thing...an empty dataframe. This is so bizarre. This code, or a very similar version of it, worked fine last week. Something changed, but I'm not sure what. — ASH, Nov 07 '19 at 15:09
first use any other program to check if folder exists, if it has the same name and if there are files. Maybe it is empty or it changed name. — furas, Nov 07 '19 at 15:25
THe data exists. I can see everything in Storage Explorer. I can load data from the lake into tables using Databricks. I just can't get the file listed out...for some odd reason... — ASH, Nov 07 '19 at 16:53
maybe Storage Explorer doesn't show files on disk but items in database? The same can be with Databricks. Maybe it reads from different storage than disk. — furas, Nov 07 '19 at 17:06
updated my answer: the reference to the databricks filesystem is missing, you need this if you are using the local APIs — Hauke Mallow, Nov 08 '19 at 07:27
with gen2 + you can use Python API to trawl your whole FS with a simple recursive call, but with 6.0 on a gen1 you have to use dbutils! was a massive headache on a project I was working on, had to create some lengthy functions to get around it. — Umar.H, Jan 10 '20 at 10:51

Hauke Mallow · Answer 1 · 2020-07-03T15:55:03.017

20

Databricks File System (DBFS) is a distributed file system mounted into an Azure Databricks workspace and available on Azure Databricks clusters. If you are using local file API you have to reference the Databricks filesystem. Azure Databricks configures each cluster node with a FUSE mount /dbfs that allows processes running on cluster nodes to read and write to the underlying distributed storage layer with local file APIs (see also the documentation).

So in the path /dbfs: has to be included:

root = "/dbfs/mnt/rawdata/parent/"

That is different then working with the Databricks Filesystem Utility (DBUtils). The file system utilities access Databricks File System, making it easier to use Azure Databricks as a file system:

dbutils.fs.ls("/mnt/rawdata/parent/")

For larger Data Lakes I can recommend a Scala example in the Knowledge Base. Advantage is that it runs the listing for all child leaves distributed, so will work also for bigger directories.

edited Jul 03 '20 at 15:55

answered Nov 07 '19 at 21:01

Hauke Mallow

2,887
3
11
29

I don't understand why, but for me, when using scala + java.io, I had to include the dbfs prefix. When using `dbutils.fs.ls` I did not. – Nick.Mc Jul 03 '20 at 08:14
Reason might be that you don' t access data in a mount point path what is done in the examples above. Data written to mount point paths (/mnt) is stored outside of the DBFS root. For dbfs path you have to use dbfs:/ – Hauke Mallow Jul 03 '20 at 15:53
works perfectly for `abfss://` as well (azure blob file system) – DaReal Feb 02 '22 at 01:12

ASH · Accepted Answer · 2019-11-11T21:27:41.220

I got this to work.

from azure.storage.blob import BlockBlobService 

blob_service = BlockBlobService(account_name='your_account_name', account_key='your_account_key')

blobs = []
marker = None
while True:
    batch = blob_service.list_blobs('rawdata', marker=marker)
    blobs.extend(batch)
    if not batch.next_marker:
        break
    marker = batch.next_marker
for blob in blobs:
    print(blob.name)

The only prerequisite is that you need to import azure.storage. So, in the Clusters window, click 'Install-New' -> PyPI > package = 'azure.storage'. Finally, click 'Install'.

Shaun Bowman · Answer 3 · 2022-12-07T14:55:57.647

I wrote this & it works for me - it utilises the "dbutils.fs.ls" technique at the heart, and adds a recursive element to traverse subdirectories.

You just have to specify the root directory & it'll return paths to all the ".parquet"'s it finds.

#------
# find parquet files in subdirectories recursively
def find_parquets(dbfs_ls_list):
    parquet_list = []
    if isinstance(dbfs_ls_list, str):
        # allows for user to start recursion with just a path
        dbfs_ls_list = dbutils.fs.ls(root_dir)
        parquet_list += find_parquets(dbfs_ls_list)
    else:
        for file_data in dbfs_ls_list:
            if file_data.size == 0 and file_data.name[-1] == '/':
                # found subdir
                new_dbdf_ls_list = dbutils.fs.ls(file_data.path)
                parquet_list += find_parquets(new_dbdf_ls_list)
            elif '.parquet' in file_data.name:
                parquet_list.append(file_data.path)
    return parquet_list

#------
root_dir = 'dbfs:/FileStore/my/parent/folder/'
file_list = find_parquets(root_dir)

List All Files in a Folder Sitting in a Data Lake

3 Answers3