11

I need to process all files in a directory tree recursively, but with a limited depth.

That means for example to look for files in the current directory and the first two subdirectory levels, but not any further. In that case, I must process e.g. ./subdir1/subdir2/file, but not ./subdir1/subdir2/subdir3/file.

How would I do this best in Python 3?

Currently I use os.walk to process all files up to infinite depth in a loop like this:

for root, dirnames, filenames in os.walk(args.directory):
    for filename in filenames:
        path = os.path.join(root, filename)
        # do something with that file...

I could think of a way counting the directory separators (/) in root to determine the current file's hierarchical level and break the loop if that level exceeds the desired maximum.

I consider this approach as maybe insecure and probably pretty inefficient when there's a large number of subdirectories to ignore. What would be the optimal approach here?

Georgy
  • 12,464
  • 7
  • 65
  • 73
Byte Commander
  • 6,506
  • 6
  • 44
  • 71
  • Related: [List all subdirectories on given level](https://stackoverflow.com/q/16810686/7851470) – Georgy Oct 05 '19 at 09:48

2 Answers2

21

I think the easiest and most stable approach would be to copy the functionality of os.walk straight out of the source and insert your own depth-controlling parameter.

import os
import os.path as path

def walk(top, topdown=True, onerror=None, followlinks=False, maxdepth=None):
    islink, join, isdir = path.islink, path.join, path.isdir

    try:
        names = os.listdir(top)
    except OSError, err:
        if onerror is not None:
            onerror(err)
        return

    dirs, nondirs = [], []
    for name in names:
        if isdir(join(top, name)):
            dirs.append(name)
        else:
            nondirs.append(name)

    if topdown:
        yield top, dirs, nondirs

    if maxdepth is None or maxdepth > 1:
        for name in dirs:
            new_path = join(top, name)
            if followlinks or not islink(new_path):
                for x in walk(new_path, topdown, onerror, followlinks, None if maxdepth is None else maxdepth-1):
                    yield x
    if not topdown:
        yield top, dirs, nondirs

for root, dirnames, filenames in walk(args.directory, maxdepth=2):
    #...

If you're not interested in all those optional parameters, you can pare down the function pretty substantially:

import os

def walk(top, maxdepth):
    dirs, nondirs = [], []
    for name in os.listdir(top):
        (dirs if os.path.isdir(os.path.join(top, name)) else nondirs).append(name)
    yield top, dirs, nondirs
    if maxdepth > 1:
        for name in dirs:
            for x in walk(os.path.join(top, name), maxdepth-1):
                yield x

for x in walk(".", 2):
    print(x)
Kevin
  • 74,910
  • 12
  • 133
  • 166
  • That's a pretty long piece of code for a small problem... I'd prefer a more compact solution if possible. And I think you mean `for ... in walk(...):` in the second last line instead of `os.walk`, don't you? – Byte Commander Feb 10 '16 at 13:23
  • Funny, I was just composing a shorter version :-) and you're right about the errant `os.` on the penultimate line; fixed. – Kevin Feb 10 '16 at 13:30
  • That short version looks cool. I modified it to not return directories (as I only need files), and to compare `if maxdepth != 0` so that 0 means only the current directory and I can use negative values to travel the entire directory structure. – Byte Commander Feb 10 '16 at 13:43
12

Starting in python 3.5, os.scandir is used in os.walk instead of os.listdir. It works many times faster. I corrected @kevin sample a little.

import os

def walk(top, maxdepth):
    dirs, nondirs = [], []
    for entry in os.scandir(top):
        (dirs if entry.is_dir() else nondirs).append(entry.path)
    yield top, dirs, nondirs
    if maxdepth > 1:
        for path in dirs:
            for x in walk(path, maxdepth-1):
                yield x

for x in walk(".", 2):
    print(x)
Arty
  • 579
  • 1
  • 8
  • 17
  • 1
    it's much faster on windows. And there are backports (`scandir` module) for python < 3.5 – Jean-François Fabre Dec 13 '18 at 21:52
  • 3
    walkMaxDepth is not defined. should be walk? – pacukluka Apr 07 '21 at 18:49
  • 2
    It's funny that in two years no one paid attention to this mistake. I took the code from two different places and the copy paste resulted in different names. This is recursion and instead of walkMaxDepth there should be the name of the motherboard function walk. I have fixed this in code. Thank you for paying attention to this. I myself suffer a lot when the finished snippet does not work. – Arty Apr 08 '21 at 21:30
  • To be sufficiently like os.walk, the nodirs list should consist of basenames only. In these solutions, it contains full paths. This is tad ugly, but it would make walk & os.walk produce similar results: `(dirs if entry.is_dir() else nondirs).append(entry.path if entry.is_dir() else os.path.basename(entry.path))` – Brian K Oct 29 '21 at 23:19