list the files of a directory and subdirectory recursively in Databricks(DBFS)

Question

Using python/dbutils, how to display the files of the current directory & subdirectory recursively in Databricks file system(DBFS).

Use `walk` with a path similar to `/dbfs/mnt/my/mount/...` (and not `dbfs:/mnt/my/mount/...` - mind the prefix) — David דודו Markovitz, May 31 '21 at 17:37

score 16 · Accepted Answer · answered Sep 18 '20 at 14:01

Surprising thing about dbutils.fs.ls (and %fs magic command) is that it doesn't seem to support any recursive switch. However, since ls function returns a list of FileInfo objects it's quite trivial to recursively iterate over them to get the whole content, e.g.:

def get_dir_content(ls_path):
  dir_paths = dbutils.fs.ls(ls_path)
  subdir_paths = [get_dir_content(p.path) for p in dir_paths if p.isDir() and p.path != ls_path]
  flat_subdir_paths = [p for subdir in subdir_paths for p in subdir]
  return list(map(lambda p: p.path, dir_paths)) + flat_subdir_paths
    

paths = get_dir_content('/databricks-datasets/COVID/CORD-19/2020-03-13')
[print(p) for p in paths]

you could also use `%ls -R ` – FrankZhu Aug 18 '21 at 18:18 — FrankZhu, Aug 18 '21 at 18:18

choeh · Answer 2 · 2021-06-01T11:35:43.443

An alternative implementation can be done with generators and yield operators. You have to use at least Python 3.3+ for yield from operator and check out this great post for a better understanding of yield operator:

def get_dir_content(ls_path):
    for dir_path in dbutils.fs.ls(ls_path):
        if dir_path.isFile():
            yield dir_path.path
        elif dir_path.isDir() and ls_path != dir_path.path:
            yield from get_dir_content(dir_path.path)
    
list(get_dir_content('/databricks-datasets/COVID/CORD-19/2020-03-13'))

score 3 · Answer 3 · edited Mar 17 '23 at 05:33

3

You could also try this recursive function:

def lsR(path):
    return [
        fname
        for flist in [
            ([fi.path] if fi.isFile() else lsR(fi.path))
            for fi in dbutils.fs.ls(path)
        ]
        for fname in flist
    ]


lsR("/your/folder")

edited Mar 17 '23 at 05:33

Gaurav Gandhi

3,041
2
27
40

answered Nov 09 '21 at 18:19

Marcin Skotis

31
1

The answer would be more useful if you can add some explanation along with the code. – holydragon Nov 11 '21 at 10:34
I will gladly do it, but I'm not sure what kind of explanation would you expect? lsR() should return a list of file names so: 1. part1: [([fi.path] if fi.isFile() else lsR(fi.path)) for fi in dbutils.fs.ls(path)] builds a list of lists. For each result of dbutils.fs.ls If fi is a file it puts list with only one item else if fi is a directory it calls recursively lsR() to get list of file names 2. Then the part1 is "unpacked" by double comprehension [fname for flist in for fname in flist] This changes [['a'], ['b'], ['c', 'd', 'e']] into ['a', 'b', 'c', 'd', 'e'] – Marcin Skotis Nov 15 '21 at 23:01

score 0 · Answer 4 · answered Aug 13 '21 at 05:23

There are other answers listed here, but it worth noting that databricks stores datasets as folders.

For example, you might have a 'directory' called my_dataset_here, which contains files like this:

my_dataset_here/part-00193-111-c845-4ce6-8714-123-c000.snappy.parquet
my_dataset_here/part-00193-123-c845-4ce6-8714-123-c000.snappy.parquet
my_dataset_here/part-00193-222-c845-4ce6-8714-123-c000.snappy.parquet
my_dataset_here/part-00193-444-c845-4ce6-8714-123-c000.snappy.parquet
...

There will be thousands of these such files in a typical set of tables.

Attempting to enumerate every single file in such a folder can take a very long time... like, minutes, because the single call to dbutils.fs.ls must return an array of every single result.

Therefore, a naive approach such as:

stack = ["/databricks-datasets/COVID/CORD-19/2020-03-13"]
while len(stack) > 0:
  current_folder = stack.pop(0)
  for file in dbutils.fs.ls(current_folder):
    if file.isDir():
      stack.append(file.path)
      print(file.path)
    else:
      print(file.path)

Will indeed list every file, but it will also take forever to finish. In my test environment, enumerating over 50 odd tables took 8 minutes.

However, the new 'delta' format, if used, creates a standard named folder called '_delta_log' inside delta table folders.

We can therefore modify our code to check each folder to see if it is a dataset before attempting to enumerate the entire contents of the folder:

stack = ["/databricks-datasets/COVID/CORD-19/2020-03-13"]
while len(stack) > 0:
  current_folder = stack.pop(0)
  for file in dbutils.fs.ls(current_folder):
    if file.isDir():
      # Check if this is a delta table and do not recurse if so!
      try:
        delta_check_path = f"{file.path}/_delta_log"
        dbutils.fs.ls(delta_check_path)  # raises an exception if missing
        print(f"dataset: {file.path}")
      except:            
        stack.append(file.path)
        print(f"folder: {file.path}")
    else:
        print(f"file: {file.path}")

This code runs on the same test environment in 38 seconds.

In trivial situations, the naive solution is acceptable, but it quickly becomes totally unacceptable in real world situations.

Notice that this code will only work on delta tables; if you are using parquet/csv/whatever format, you're out of luck.

list the files of a directory and subdirectory recursively in Databricks(DBFS)

4 Answers4

Linked