os.walk/scandir slow on network drive

Question

I trying to find all .xlsm files (and get their stats) on the network drive O:\ provided they are not in a folder called, Test. I was using os.walk and switched to scandir.walk because it's faster. I am now just limited by the network speed. This code seems to have a lot of interaction between script and network drive. My code is below. Is there a way to speed to this up maybe using a batch file? I'm on Windows.

from scandir import scandir, walk
import sys

def subdirs(path):
    for path, folders, files in walk(path):
        if 'Test' not in path:
            for sub_files in scandir(path):
                if '.xlsm' in sub_files.path:
                    yield subfiles.stat()

for i in subdirs('O:\\'):
    print i

ShadowRanger · Accepted Answer · 2016-03-01T19:23:06.730

3

You are double-scanning every path, once implicitly via walk, then again by explicitly re-scandiring the path walk returned for no reason. walk already returned the files, so the inner loop can avoid a double-scan by just using what it was given:

def subdirs(path):
    for path, folders, files in walk(path):
        for file in files:
            if '.xlsm' in file:
                yield os.path.join(path, file)

To address updated question, you'll probably want to either copy the existing scandir.walk code and modify it to return lists of DirEntrys instead of lists of names, or write similar special cased code for your specific needs; either way, this will allow you to avoid double-scanning, while keeping scandir's special low overhead behavior. For example:

def scanwalk(path, followlinks=False):
    '''Simplified scandir.walk; yields lists of DirEntries instead of lists of str'''
    dirs, nondirs = [], []
    for entry in scandir.scandir(path):
        if entry.is_dir(follow_symlinks=followlinks):
            dirs.append(entry)
        else:
            nondirs.append(entry)
    yield path, dirs, nondirs
    for dir in dirs:
        for res in scanwalk(dir.path, followlinks=followlinks):
            yield res

You can then replace your use of walk with it like this (I also added code that prunes directories with Test in them since all directories and files under them would have been rejected by your original code, but you'd still traverse them unnecessarily):

def subdirs(path):
    # Full prune if the path already contains Test
    if 'Test' in path:
        return
    for path, folders, files in scanwalk(path):
        # Remove any directory with Test to prevent traversal
        folders[:] = [d for d in folders if 'Test' not in d.name]
        for file in files:
            if '.xlsm' in file.path:
                yield file.stat()  # Maybe just yield file to get raw DirEntry?

for i in subdirs('O:\\'):
    print i

BTW, you may want to double check that you've properly installed/built the C accelerator for scandir, _scandir. If _scandir isn't built, the scandir module provides fallback implementations using ctypes, but they're significantly slower, which could explain performance problems. Try running import _scandir in an interactive Python session; if it raises ImportError, then you don't have the accelerator, so you're using the slow fallback implementation.

edited Mar 01 '16 at 19:23

answered Mar 01 '16 at 04:17

ShadowRanger

143,180
12
188
271

The only problem with this is that I loose the other variables associated with files that I would normally need to get via `os.stat()` see: https://github.com/benhoyt/scandir – user2242044 Mar 01 '16 at 05:02
@user2242044: I know about `scandir` (I evangelize it regularly), but your original code was only `yield`ing the path, not the `DirEntry` object; you weren't making use of the free/cached `stat` behavior. If you need the `stat` info, you may want to copy the `scandir.walk` implementation and modify it to return `list`s of the raw `DirEntry` objects for `dirnames` and `filenames` instead of returning a `list` of `str`. That gets the best of both worlds; single pass scanning, no repetitive `stat`ing. Or just implement your own recursive function that specializes to your use case. – ShadowRanger Mar 01 '16 at 13:57
@ShadowRanger..Thanks for the clarification. I tried to simplify the code for the SO question, but think I lost some of the intent. The double iteration was because I was trying to avoid iterating over some folders and I do want to the stat yield. Modified the original question. Thanks! – user2242044 Mar 01 '16 at 15:27
@user2242044: Gotcha. Updated answer with example of how you'd implement a `walk`-like function that `yield`ed `list`s of `DirEntry` instead of `list`s of `str`. Removing the double-scan should make it as fast as possible in Python; `scandir` is using `FindFirstFile`/`FindNextFile` already, which is the most efficient Windows file scanning API (doubly so when you need `stat` info, or when the directory being scanned is high latency; it reduces the number of round trips to something closer to one per directory, instead of one per directory + one per file). No batch file techniques will beat it. – ShadowRanger Mar 01 '16 at 19:18
@user2242044: I also added a note about the C accelerator at the bottom; you'll want to verify you actually built the accelerator. Without it, `scandir` would still save you network traffic, but it would add a ton of overhead to the normal code overhead. – ShadowRanger Mar 01 '16 at 19:23
Thanks for that added note, in PyCharm it says no module named `_scandir` yet no import error in this when run or in the Python IDLE, so I assume I have it? Also, you mentioned it was faster than batch filing. My thought there was something like `dir /s/b .\*.xlsm` – user2242044 Mar 01 '16 at 19:58
@user2242044: `dir` is implemented in terms of `FindFirstFile`/`FindNextFile`, just like `scandir` (because of this, it doesn't take noticeable additional time to do `dir` vs. `dir /b` even on a high latency network share, even though the former echoes `stat` info which would otherwise require a round trip). While it's implemented entirely in C/C++ (no Python byte code execution overhead), to actually use the output, you'd need to read from its `stdout` and parse in Python (or whatever language you like) which would undo any savings you might achieve. – ShadowRanger Mar 01 '16 at 20:04
also getting a couple of errors `AttributeError: 'unicode' object has no attribute 'name'`. That should just be `d`? The same error applies to the `file.path`. – user2242044 Mar 01 '16 at 20:06
@user2242044: `d` should be a `DirEntry`, did you use the code as I wrote it (using the custom `scanwalk` over `scandir.walk`)? The whole point of `scanwalk` is to get `DirEntry`s. – ShadowRanger Mar 01 '16 at 20:11
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/105049/discussion-between-user2242044-and-shadowranger). – user2242044 Mar 01 '16 at 20:12
originally posted in chat, but this seems useful to others reading this so posting here. two questions: 1. This code seems to look in specified folder and sub-folder only, not all sub-folder levels. Is that correct? 2. In the scanwalk() function, you have two yield functions (not separated out by an if-else statement). I'm not seeing a situation in which the 'yield res' would be reached. – user2242044 Mar 02 '16 at 15:37
@user2242044: Responded at length in chat, but in summary: It recurses through the entire tree, not just the provided folder and a single level of sub-folders. `subdirs` prunes directories with `Test` in their name, preventing recursion into them, but if you have a directory `C:\a\b\c\d` (no other directories involved), then you should get `path` and `DirEntry` info for `C:\a`, then `C:\a\b`, then `C:\a\b\c`, then `C:\a\b\c\d`, and so on, however deep the tree goes (though you may have handle exhaustion issues if the depth of the tree is thousands of directories deep). – ShadowRanger Mar 02 '16 at 16:37
@user2242044: As for #2, you need to understand what `yield` does to a function; just because you hit the first `yield` doesn't mean the function stops forever, it just means it pauses until the caller iterates the resulting generator again. See [What does the yield keyword do in Python?](https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do-in-python). The second `yield` will execute as many times as there are `dirs` (if the caller mutates a `yield`ed `dirs`, this prunes the set of sub-directories to recurse into), as long as you keep iterating. – ShadowRanger Mar 02 '16 at 16:38

os.walk/scandir slow on network drive

1 Answers1