You are double-scanning every path, once implicitly via walk
, then again by explicitly re-scandir
ing the path
walk
returned for no reason. walk
already returned the files
, so the inner loop can avoid a double-scan by just using what it was given:
def subdirs(path):
for path, folders, files in walk(path):
for file in files:
if '.xlsm' in file:
yield os.path.join(path, file)
To address updated question, you'll probably want to either copy the existing scandir.walk
code and modify it to return list
s of DirEntry
s instead of list
s of names, or write similar special cased code for your specific needs; either way, this will allow you to avoid double-scanning, while keeping scandir
's special low overhead behavior. For example:
def scanwalk(path, followlinks=False):
'''Simplified scandir.walk; yields lists of DirEntries instead of lists of str'''
dirs, nondirs = [], []
for entry in scandir.scandir(path):
if entry.is_dir(follow_symlinks=followlinks):
dirs.append(entry)
else:
nondirs.append(entry)
yield path, dirs, nondirs
for dir in dirs:
for res in scanwalk(dir.path, followlinks=followlinks):
yield res
You can then replace your use of walk
with it like this (I also added code that prunes directories with Test
in them since all directories and files under them would have been rejected by your original code, but you'd still traverse them unnecessarily):
def subdirs(path):
# Full prune if the path already contains Test
if 'Test' in path:
return
for path, folders, files in scanwalk(path):
# Remove any directory with Test to prevent traversal
folders[:] = [d for d in folders if 'Test' not in d.name]
for file in files:
if '.xlsm' in file.path:
yield file.stat() # Maybe just yield file to get raw DirEntry?
for i in subdirs('O:\\'):
print i
BTW, you may want to double check that you've properly installed/built the C accelerator for scandir
, _scandir
. If _scandir
isn't built, the scandir
module provides fallback implementations using ctypes
, but they're significantly slower, which could explain performance problems. Try running import _scandir
in an interactive Python session; if it raises ImportError
, then you don't have the accelerator, so you're using the slow fallback implementation.