200

I'm writing a script that descends into a directory tree (using os.walk()) and then visits each file matching a certain file extension. However, since some of the directory trees that my tool will be used on also contain sub directories that in turn contain a LOT of useless (for the purpose of this script) stuff, I figured I'd add an option for the user to specify a list of directories to exclude from the traversal.

This is easy enough with os.walk(). After all, it's up to me to decide whether I actually want to visit the respective files / dirs yielded by os.walk() or just skip them. The problem is that if I have, for example, a directory tree like this:

root--
     |
     --- dirA
     |
     --- dirB
     |
     --- uselessStuff --
                       |
                       --- moreJunk
                       |
                       --- yetMoreJunk

and I want to exclude uselessStuff and all its children, os.walk() will still descend into all the (potentially thousands of) sub directories of uselessStuff, which, needless to say, slows things down a lot. In an ideal world, I could tell os.walk() to not even bother yielding any more children of uselessStuff, but to my knowledge there is no way of doing that (is there?).

Does anyone have an idea? Maybe there's a third-party library that provides something like that?

antred
  • 3,697
  • 2
  • 29
  • 37

2 Answers2

335

Modifying dirs in-place will prune the (subsequent) files and directories visited by os.walk:

# exclude = set(['New folder', 'Windows', 'Desktop'])
for root, dirs, files in os.walk(top, topdown=True):
    dirs[:] = [d for d in dirs if d not in exclude]

From help(os.walk):

When topdown is true, the caller can modify the dirnames list in-place (e.g., via del or slice assignment), and walk will only recurse into the subdirectories whose names remain in dirnames; this can be used to prune the search...

FLASH3G
  • 23
  • 4
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • 1
    Thanks, man, you're awesome! (And I guess I need to get in the habit of reading the module docs more in depth rather than just skim over them in a hurry!) – antred Nov 08 '13 at 13:19
  • 72
    @ben: `dirs[:] = value` modifies `dirs` *in-place*. It changes the contents of the list `dirs` without changing the container. As `help(os.walk)` mentions, this is needed if you wish to affect the way `os.walk` traverses the subdirectories. (`dirs = value` merely reassigns (or "binds") the variable `dirs` to a new list, without modifying the original `dirs`.) – unutbu Dec 10 '14 at 22:25
  • 8
    You can also use `filter()`: `dirs[:] = list(filter(lambda x: not x in exclude, dirs))` – NuclearPeon Apr 28 '15 at 02:26
  • Is there any way to do this globally to `os.walk`? Lets say to exclude all */.git directories in all `os.walk` calls? – bmikolaj Dec 02 '15 at 06:31
  • 3
    @p014k: You could write your own generator function which calls `os.walk` and yields `root, dirs, files` after excluding `.git` (or whatever else you wish) from `dirs`. – unutbu Dec 02 '15 at 11:12
  • When I try the follow, I get a `TypeError: 'str' object does not support item assignment`: `def mywalk(walk_path): exclude = set(['.git']) for dirs, sub, files in os.walk(walk_path, topdown=True): dirs[:] = [d for d in dirs if d not in exclude] yield (dirs, sub, files)` What did I do wrong? – bmikolaj Dec 02 '15 at 22:32
  • 2
    In `for dirs, sub, files in os.walk(walk_path)`, `dirs` is what I called `root` above, and `sub` is what I called `dirs`. Thus your `dirs` is a string, and `sub` is a list of directories. `dirs[:] = ...` raises an error because the string `dirs` does not support item assignment. You basically have the right idea, but I suggest you stick with `for root, dirs, files in os.walk(...)`. See [the docs](https://docs.python.org/2/library/os.html#os.walk) for the meaning of the 3-tuple `(root, dirs, files)` yielded by `os.walk`. – unutbu Dec 02 '15 at 23:07
  • 1
    Wish I could give more than one +1: one for the answer and two for learning about the `list[:]` re-assignment. – OozeMeister Feb 04 '16 at 21:55
  • 5
    @unutbu Just letting you know that in one case, this optimization has reduced traversal time from more than 100 seconds to about 2 seconds. That's what I call a worthwhile optimization. :D – antred Apr 14 '16 at 13:09
  • 2
    Is this by design? This feels sooo hacky. – Moberg Sep 25 '18 at 16:03
  • why set() and not the tuple in exclude? Because it is unchangeable? – Timo Dec 06 '20 at 14:07
13

... an alternative form of @unutbu's excellent answer that reads a little more directly, given that the intent is to exclude directories, at the cost of O(n**2) vs O(n) time.

(Making a copy of the dirs list with list(dirs) is required for correct execution)

# exclude = set([...])
for root, dirs, files in os.walk(top, topdown=True):
    [dirs.remove(d) for d in list(dirs) if d in exclude]
Dmitri
  • 757
  • 1
  • 7
  • 15
  • 8
    If you want to be more direct at the cost of some memory, you'd better write `dirs[:] = set(dirs) - exclude`. At least it's still \$O(n)\$ and you don't build a comprehension only for its side effects... – 301_Moved_Permanently Nov 05 '16 at 09:33
  • 4
    This is not really bad but not idiomatic Python either in my opinion. – Torsten Bronger Dec 16 '16 at 06:07
  • `for d in list(dirs)` is a bit odd. `dirs` is already a list. And what you have isn't really a list comprehension. `dirs.remove(d)` doesn't return anything, so you end up with a list full of `None`s. I agree with @Torsten. – seanahern Jul 22 '20 at 15:13
  • @ seanahern - I think `list(dirs)` is needed to make a copy of `dirs` so that you are iterating over a copy of the list. If you modify the same copy that is being iterated over, you will get bugs. You end up with a list of Nones but this is fine since `dirs` is modified inplace. The list comprehension is basically just a way to fit the for loop in one line. – digbyterrell Feb 11 '21 at 21:42
  • Which solution is O(n**2) and which solution is O(n)? – Étienne Feb 01 '23 at 10:11