Self written os.walk-alike is much slower then os.walk itself - why?

Question

Unfortunately this code runs slower than "os.walk", but why?

Can it be "for" cycle that causes it to run slowly?

"Code that works like 'os.walk': (The "os.walk" function does what it does)

Note: I wrote to improve myself!:

import os, time
from os.path import *

x = ""
y = []
z = []
var = 0

def walk(xew):
    global top, var, x,y,z
    if not var: var = [xew]
    for i in var:
        try:
            for ii in os.listdir(i):
                y.append(ii) if isdir(i+os.sep+ii) else z.append(ii)

            x = top = i
            var = [top+os.sep+i for i in os.listdir(top) if isdir(top+os.sep+i)]         
        except:
            continue
        yield x,y,z
        yield from walk(var)
        var.clear();y.clear();z.clear()

for example:

It ends in 2 seconds:

for x,y,z in walk(path):
    print(x)

It is in 0.5 seconds:

for x,y,z in os.walk(path):
    print(x)

Apart from implementation details, definitely check this https://stackoverflow.com/questions/12590058/python-performance-with-global-variables-vs-local if it will affect the performance — mfrackowiak, Jan 28 '19 at 12:46
@mfrackowiak: that doesn't account for the big difference, however. — Martijn Pieters, Jan 28 '19 at 12:48
I would guess anything that takes `os.walk` half a second to walk to be a somewhat large directory tree. Perhaps with some symlinks. OPs implementation follows symlink directories whilst the version in `os` does not by default. — Dunes, Jan 28 '19 at 13:30

Martijn Pieters · Accepted Answer · 2019-01-28T14:23:05.840

os.walk() doesn't use os.listdir(). It uses the much faster os.scandir() function, which provides an iterator with more information per directory entry:

Using scandir() instead of listdir() can significantly increase the performance of code that also needs file type or file attribute information, because os.DirEntry objects expose this information if the operating system provides it when scanning a directory. All os.DirEntry methods may perform a system call, but is_dir() and is_file() usually only require a system call for symbolic links; os.DirEntry.stat() always requires a system call on Unix but only requires one for symbolic links on Windows.

The os.walk() code makes heavy use of the DirEntry.is_dir() call, which with os.scandir() is much cheaper than using os.isdir() (which must make separate os.stat() calls).

Next, your code is calling os.isdir() too often. You are effectively calling it twice for every file entry in your path. You already collected all the subdirectories in y, you don't need to test the paths again when re-creating var. These extra isdir() calls cost you a lot of time.

You also recurse when var is empty (no further subdirectories), causing you to first wrap the empty list in another list, after which os.listdir() throws a TypeError exception which your blanket Pokemon-catch-em-all except handler silences.

Next, you should get rid of the global variables, and use proper variable names. files and dirs would be far clearer names than y and z. Because you made y and z globals you are retaining all file and directory names for a given level, and for every first subdirectory on down, you then re-report those same file and directory names as if they are members of those subdirectories. Only when the first leaf of such a directory tree (with no further subdirectories) is reached do the .clear() calls on y and z get executed, leading to very confusing results with repeated filenames.

You can study the os.walk() source code, but if we simplify it down to only use top-down traversal and no error handling, then it comes down to:

def walk(top):
    dirs = []
    nondirs = []

    with os.scandir(top) as scandir_it:
        for entry in scandir_it:
            if entry.is_dir():
                dirs.append(entry.name)
            else:
                nondirs.append(entry.name)

    yield top, dirs, nondirs

    for dirname in dirs:
        new_path = os.path.join(top, dirname) 
        yield from walk(new_path)

Note that there are no global variables used; there simply is no need for any in this algorithm. There is only a single os.scandir() call per directory, and the dirs variable is re-used to recurse into subdirectories.

If I'm not mistaken the change to using `os.scandir` (which was a relatively recent work by Giampaolo Rodolà) accounts for something like 20-50% performance improvement and a similar amount of decrease in number of syscalls, however the OP experiences a 200% performance difference. So this is not the only reason, even though it is a significant factor. — Giacomo Alzetta, Jan 28 '19 at 12:55
@GiacomoAlzetta: it depends on the number of files and the number of subdirectories to recurse into as well. We don't know the mix here. — Martijn Pieters, Jan 28 '19 at 12:56
Yes you are right. The OP should test `os.walk` on the same directory for two different python versions (e.g. python2.7 vs python3.7) and see whether they see a significant difference or not... if there is not a huge difference then an other factor is at play, if the difference is big then it's probably just the change to `os.scandir`. — Giacomo Alzetta, Jan 28 '19 at 12:59
@GiacomoAlzetta: here it is the doubled `os.isdir()` calls that cover the remaining time differences. — Martijn Pieters, Jan 28 '19 at 13:11
@MartijnPieters Thanks for your answer, my purpose was to understand why this code was slow, and thanks to you,i learned something to satisfy myself! — Jundullah, Jan 28 '19 at 13:46

score 0 · Answer 2 · answered Jan 28 '19 at 22:22

This code it works almost as fast as os.walk!

import os, time
from os.path import *

def walk(top):
    x = top;y=[];z=[]
    try:
        for i in os.listdir(top):
            y.append(i) if isdir(top+os.sep+i) else z.append(i)
    except: pass
    else:
        yield x,y,z
        for q in y: yield from walk(top+os.sep+q)

Self written os.walk-alike is much slower then os.walk itself - why?

2 Answers2