4

Due to a large and convoluted directory structure, my script is searching too many directories:

root--
     |
     --Project A--
                  |
                  -- Irrelevant
                  -- Irrelevant
                  -- TARGET
     |
     --Project B--
                  |
                  -- Irrelevant
                  -- TARGET
                  -- Irrelevant
     |
     -- Irrelevant  --
                       |
                       --- Irrelevant

The TARGET directory is the only one I need to traverse and it has a consistent name in each project (we'll just call it Target here).

I looked at this question:

Excluding directories in os.walk

but instead of excluding, I need to include the "target" directory which isn't at the "root" level, but one level down.

I've tried something to the like of:

def walker(path):
    for dirpath, dirnames, filenames in os.walk(path):
        dirnames[:] = set(['TARGET'])

But this one effects the root directory (thereby ignoring all the directories it needs to traverse, Project A, Project B...)

Community
  • 1
  • 1
mk8efz
  • 1,374
  • 4
  • 20
  • 36

2 Answers2

4

The issue with your code is that you are always modifying the dirnames list, but this means that even at the root level all the subdirectories are removed and hence the recursive calls do not end up visiting the various Project X directories.

What you want is to purge other directories only when the TARGET one is present:

if 'TARGET' in dirnames:
    dirnames[:] = ['TARGET']

This will allow the os.walk call to visit the Project X directories, but will prevent it from going inside the Irrelevant ones.

Bakuriu
  • 98,325
  • 22
  • 197
  • 231
  • Seems like this would do nothing if the initial path is `/root` because it contains no subdirectory named `TARGET`. – martineau Nov 02 '16 at 16:58
  • 1
    @martineau Yes, that is correct. By *not* removing the directory names it *does* recursive into the other directories, which is the point of the answer. – Bakuriu Nov 02 '16 at 17:01
2

For a whitelisting scenario like this, I'd suggest using glob.iglob to get the directories by a pattern. It's a generator, so you'll get each result as fast as it finds them (Note: At time of writing, it's still implemented with os.listdir under the hood, not os.scandir, so it's only half a generator; each directory is scanned eagerly, but it only scans the next directory once it's finished yielding values from the current directory). For example, in this case:

from future_builtins import filter  # Only on Py2 to get generator based filter

import os.path
import glob

from operator import methodcaller

try:
    from os import scandir       # Built-in on 3.5 and above
except ImportError:
    from scandir import scandir  # PyPI package on 3.4 and below

# If on 3.4+, use glob.escape for safety; before then, if path might contain glob
# special characters and you don't want them processed you need to escape manually
globpat = os.path.join(glob.escape(path), '*', 'TARGET')

# Find paths matching the pattern, filtering out non-directories as we go:
for targetdir in filter(os.path.isdir, glob.iglob(globpat)):
    # targetdir is the qualified name of a single directory matching the pattern,
    # so if you want to process the files in that directory, you can follow up with:
    for fileentry in filter(methodcaller('is_file'), scandir(targetdir)):
        # fileentry is a DirEntry with attributes for .name, .path, etc.

See the docs on os.scandir for more advanced usage, or you can just make the inner loop a call to os.walk to preserve most of your original code as is.

If you really must use os.walk, you can just be more targeted in how you prune dirs. Since you specified all TARGET directories should be only one level down, this is actually pretty easy. os.walk walks top down by default, which means the first set of results will be the root directory (which you don't want to prune solely to TARGET entries). So you can do:

import fnmatch

for i, (dirpath, dirs, files) in enumerate(os.walk(path)):
    if i == 0:
        # Top level dir, prune non-Project dirs
        dirs[:] = fnmatch.filter(dirs, 'Project *')
    elif os.path.samefile(os.path.dirname(dirpath), path):
        # Second level dir, prune non-TARGET dirs
        dirs[:] = fnmatch.filter(dirs, 'TARGET')
    else:
        # Do whatever handling you'd normally do for files and directories
        # located under path/Project */TARGET/
ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • Am I mistaken or this fails if the directories might be nested more than one level? And if this doesn't fails then that means that it will still list the contents of the `Irrelevant` directories, and for each path inside them it will check that it does not verify the pattern, but still it will have to waste that time. – Bakuriu Nov 02 '16 at 16:53
  • @Bakuriu: If they might be nested more than one level, make `globpat = os.path.join(glob.escape(path), '**', 'TARGET')` and make the `iglob` call `glob.iglob(globpat, recursive=True)` and it will descend indefinitely. The OP seemed to want it exactly one level down though, so deep recursion shouldn't be needed. You can also tweak the wildcards to avoid `Irrelevant` directories for that single level case, e.g. replacing the `'*'` component with `'Project [AB]'` for very targeted selection, or `'Project *'` for any directory with a `Project ` prefix. – ShadowRanger Nov 02 '16 at 17:01
  • Yes, but my point is that if you use `**` your answer becomes inefficient. `iglob` cannot know that *nothing* below `Irrelevant` matters, yet it will still have to check all those contents (which, if `Irrelevant` contains millions of files or nested subdirectories will take significant time). – Bakuriu Nov 02 '16 at 17:03
  • @Bakuriu: Sure. But then, the OP doesn't need `**`. Sure, for very fine grained filters with deep recursive searches in large trees, globbing is simple but slow. But for the OP's case, it should be fine. I did add a more involved fully filtering `os.walk` solution for completeness that filters the top level and second level specially so it only fully processes trees under `Project */TARGET`. But if `TARGET` can be at any depth, it's not saving you work, you're still looking through everything under `Project */` looking for `TARGET` directories. – ShadowRanger Nov 02 '16 at 17:20
  • @Bakuriu: While it's not relevant to the single level check here, I did notice one reason you'd want to avoid `glob`: It follows symlinks, and can't be told not to do so. You might want this, but it's comparatively rare, and risks problems (cyclic references, or just references out to a huge directory you don't want to search). `os.walk` can follow symlinks, but doesn't by default. – ShadowRanger Nov 02 '16 at 17:42