os.walk()
doesn't use os.listdir()
. It uses the much faster os.scandir()
function, which provides an iterator with more information per directory entry:
Using scandir()
instead of listdir()
can significantly increase the performance of code that also needs file type or file attribute information, because os.DirEntry
objects expose this information if the operating system provides it when scanning a directory. All os.DirEntry
methods may perform a system call, but is_dir()
and is_file()
usually only require a system call for symbolic links; os.DirEntry.stat()
always requires a system call on Unix but only requires one for symbolic links on Windows.
The os.walk()
code makes heavy use of the DirEntry.is_dir()
call, which with os.scandir()
is much cheaper than using os.isdir()
(which must make separate os.stat()
calls).
Next, your code is calling os.isdir()
too often. You are effectively calling it twice for every file entry in your path. You already collected all the subdirectories in y
, you don't need to test the paths again when re-creating var
. These extra isdir()
calls cost you a lot of time.
You also recurse when var
is empty (no further subdirectories), causing you to first wrap the empty list in another list, after which os.listdir()
throws a TypeError
exception which your blanket Pokemon-catch-em-all except handler silences.
Next, you should get rid of the global variables, and use proper variable names. files
and dirs
would be far clearer names than y
and z
. Because you made y
and z
globals you are retaining all file and directory names for a given level, and for every first subdirectory on down, you then re-report those same file and directory names as if they are members of those subdirectories. Only when the first leaf of such a directory tree (with no further subdirectories) is reached do the .clear()
calls on y
and z
get executed, leading to very confusing results with repeated filenames.
You can study the os.walk()
source code, but if we simplify it down to only use top-down traversal and no error handling, then it comes down to:
def walk(top):
dirs = []
nondirs = []
with os.scandir(top) as scandir_it:
for entry in scandir_it:
if entry.is_dir():
dirs.append(entry.name)
else:
nondirs.append(entry.name)
yield top, dirs, nondirs
for dirname in dirs:
new_path = os.path.join(top, dirname)
yield from walk(new_path)
Note that there are no global variables used; there simply is no need for any in this algorithm. There is only a single os.scandir()
call per directory, and the dirs
variable is re-used to recurse into subdirectories.