1

So basically I want to iterate over my file system, starting from '/' (Unix) except exclusions which I want to read from a file, could be a list too or a generator or anything. So let's take an example:

# What I want to exclude in a file
# (Just few examples)
EXCLUSIONS = ['/sys/*', '/var/lock/*, '*.pyc/*']

My idea:

import fnmatch

for exclude in EXCLUSIONS:
    for root, dirs, files in os.walk('/'):
        path = root.split(os.sep)
        for p in path:
            for f in files:
                tmp = p + f
                if fnmatch.fnmatch(tmp, exclude):
                    ...

I guess this is highly inefficient and that's why it won't work. Maybe someone could give me a hint or knows a way to do this.

m1ghtfr3e
  • 77
  • 9
  • 1
    Look into this similar question: https://stackoverflow.com/questions/20638040/glob-exclude-pattern – Jan Apr 11 '21 at 21:39
  • @Jan _Use comments to ask for more information or suggest improvements. Avoid answering questions in comments._ – user1717828 Apr 11 '21 at 22:41
  • 1
    What happened when you tried reading the [documentation for os.walk](https://docs.python.org/3.8/library/os.html#os.walk)? In particular, the part about how the `topdown` parameter works and what happens if you modify the yielded `dirnames` lists in-place? There's already an example right there in the documentation of using this to avoid looking at version-control directories. All you need is the logic that tells you whether `any` of the `EXCLUSIONS` applies to your directories. – Karl Knechtel Apr 11 '21 at 22:48

4 Answers4

2

Supposing that our rule for exclusion is "the path matches any of the EXCLUSIONS, per the logic of fnmatch.fnmatch", we can write a function to encapsulate that:

def should_exclude(path):
    return any(fnmatch.fnmatch(path, exclude) for exclude in EXCLUSIONS)

(We could generalize that by accepting the exclusions as the first parameter instead of relying on the global, and then binding it using functools.partial.)

The way to make os.walk stay out of pruned directories is to walk top-down (the default) and modify the yielded lists of sub-directories in-place. We want to apply a rule iteratively to the list while also modifying it, which is tricky; the most elegant way I can think of is to use a list comprehension to create the modified version, and then slice it back into place:

def filter_inplace(source, exclude_rule):
    source[:] = [x for x in source if not exclude_rule(x)]

(Note the generalization here; we are expected to pass the filtering predicate, should_exclude, as an argument.)

Now we should be able to use os.walk as documented:

def filtered_walk(root):
    for subroot, dirs, files in os.walk(root):
        yield subroot, files # the current result
        filter_inplace(dirs, should_exclude) # set up filtered recursion

This can be varied in multiple ways depending on your exact requirements. For example, you could iterate over the files and os.path.join them to the subroot, yielding each result separately. It's worth playing around a bit and debugging, to make sure you understand exactly what subroot, dirs and files look like at each step of the iteration, and verifying that the filtering gives the results you expect.

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
0

So the answer from @Karl Knechtel, brought me up to the following:

import fnmatch
import os

EXCLUSIONS = [...]     # Defined Files / Paths to exclude.

# A function to check each path / file
# against the EXCLUSIONS
def should_exclude(path) -> bool:
    return any(fnmatch.fnmatch(path, exclude) for exclude in EXCLUSIONS)

# A function to filter out unwanted files.
def filter_path(source, rule) -> str:
    # Just return wanted paths.
    if not rule(source):
        return source

# Now the function to walk through the
# given path.
def filtered_walk(path):
    for root, dirs, files in os.walk(path):
        # Now create real paths from dirs
        for d in dirs:
            tmp_dir = os.path.realpath(os.path.join(root, d))
            if filter_path(tmp_dir, should_exclude):
                yield tmp_dir

        # To reach the files the same procedure.
        for f in files:
            ...
            # Same logic as above..

I still think that this is still not efficient at all, especially for bigger file systems. Probably it is optimizable.

m1ghtfr3e
  • 77
  • 9
0

You can use glob package as follows to list files as including/excluding them in a directory and add recursive=True for all subdirs: (i.e. List all .tif files except the ones start with 'nerd')

import glob
base_path  = '/'
folderList = list(set(glob.glob(base_path+"*.tif",recursive=True)) - set(glob.glob(base_path+"nerd*.tif",recursive=True)))
-1
import os

exclude = ['gui', 'sys']


def bypass_dirs(path, exclude_dirs):
    if os.path.split(path)[-1] in exclude_dirs:
        return
    try:
        dirs = os.listdir(path)
        for dir_path in dirs:
            bypass_dirs(os.path.join(path, dir_path), exclude_dirs)
    except NotADirectoryError:
        return
    except PermissionError:
        print('Permission Error', path)
        return
    except FileNotFoundError:
        return


bypass_dirs('/', exclude)

This code allowed you to pass all file system. You can select enter point and exclude dirs. It has very simple exclude mechanism but you can change it how convenient for you.

Andrey RF
  • 342
  • 3
  • 13