-1

In My Python Script I was determining directory size in Azure Datalake storage Gen2. And the code works fine until I check for a bigger directory.

import sys
from dbutils import FileInfo
from typing import List

sys.setrecursionlimit(2000)
root_path = "/mnt/datalake/.../"

def discover_size(path: str, verbose: bool = True):
  def loop_path(paths: List[FileInfo], accum_size: float):
    if not paths:
      return accum_size
    else:
      head, tail = paths[0], paths[1:]
      if head.size > 0:
        if verbose:
          accum_size += head.size / 1e6
        return loop_path(tail, accum_size)
      else:
        extended_tail = dbutils.fs.ls(head.path) + tail
        return loop_path(extended_tail, accum_size)

  return loop_path(dbutils.fs.ls(path), 0.0)

discover_size(root_path, verbose=True) 

First see OOM(Out of Memory) Issue and added

sys.setrecursionlimit(2000).

Now, Another error-

RecursionError: maximum recursion depth exceeded in comparison

How to overcome this issue.

Crime_Master_GoGo
  • 1,641
  • 1
  • 20
  • 30

1 Answers1

0

The docs for dbutils.fs.ls() are far from perfect and I don't have a DataBricks environment at hand, but something like this should probably work better without using real recursion, but a list of paths left to visit.

import dbutils


def discover_size(path: str) -> int:
    total_size = 0
    visited = set()
    to_visit = [path]
    while to_visit:
        path = to_visit.pop(0)
        if path in visited:
            print("Already visited %s..." % path)
            continue
        visited.add(path)
        print(
            f"Visiting %s, size %s so far..." % (path, total_size),
        )
        for info in dbutils.fs.ls(path):
            total_size += info.size
            if info.isDir():
                to_visit.add(info.path)
    return total_size


discover_size("/mnt/datalake/.../", verbose=True)

AKX
  • 152,115
  • 15
  • 115
  • 172
  • This shows syntax error with print(f"Already visited {path}...") with "f" if I nullify this line, I get no outcome. – Crime_Master_GoGo May 19 '20 at 12:46
  • That syntax error means you're using an older version of Python. Anyway, updated the code for compatibility – AKX May 19 '20 at 13:20