7

I am using os.walk to build a map of a data-store (this map is used later in the tool I am building)

This is the code I currently use:

def find_children(tickstore):
    children = []
    dir_list = os.walk(tickstore)
    for i in dir_list:
        children.append(i[0])
    return children

I have done some analysis on it:

dir_list = os.walk(tickstore) runs instantly, if I do nothing with dir_list then this function completes instantly.

It is iterating over dir_list that takes a long time, even if I don't append anything, just iterating over it is what takes the time.

Tickstore is a big datastore, with ~10,000 directories.

Currently it takes approx 35minutes to complete this function.

Is there any way to speed it up?

I've looked at alternatives to os.walk but none of them seemed to provide much of an advantage in terms of speed.

Joe Smart
  • 751
  • 3
  • 10
  • 28
  • 2
    `return [dir for dir, _, _ in os.walk(tickstore)]` might be a bit more efficient, but hard drive access is relatively slow in general. – jonrsharpe Sep 08 '15 at 10:27
  • What do you do with the generated list of children? Maybe you don't even need the entire list since the next thing you do is to filter/scan the sequence for things? – Frerich Raabe Sep 08 '15 at 10:28
  • I'd assume even slower as it is accessing a network drive? – Joe Smart Sep 08 '15 at 10:28
  • Accessing a network drive has some more overhead. – Matthias Sep 08 '15 at 10:30
  • Next step is to find valid files within that list on directories, yes. But to filter them out at this earlier stage would still require iteration and it is the iteration that's taking the time. – Joe Smart Sep 08 '15 at 10:30
  • What version of Python are you using? – Thomas Orozco Sep 08 '15 at 10:31
  • @JoeSmart if you're looking for valid *files*, why are you only looking at the *directories* from `os.walk`? Are you then going back to get the files for each directory (which you've just thrown away)? This is starting to seem a lot like an XY problem... – jonrsharpe Sep 08 '15 at 10:38
  • `dir_list` is a [**generator**](https://www.python.org/dev/peps/pep-0255/), not a `list` (c: So, it'll only access the drive when iterated over. – Peter Wood Sep 08 '15 at 10:38
  • @jonrsharpe the `os.walk` creates a list of directories in which files can be stored. The structure of this datastore also allows me to filter, so the whole directory name is needed. The next function I call on children uses the directories along with `os.listdir` to filter out invalid files within those directories. – Joe Smart Sep 08 '15 at 10:41
  • 2
    @JoeSmart: `os.walk()` gets your the file list already. Don't discard the info, use it. – jfs Sep 08 '15 at 10:44
  • @JoeSmart I strongly suggest that you **read the documentation** for the function you're using. Each iteration of `os.walk` returns three things: the directory you're currently in; a list of the sub-directories in that directory; and a list of the files in that directory. It is also a generator, so can be used more efficiently if you don't need everything at once. It seems certain that there is a smarter way to do what you're currently doing that will give both performance and readability/maintainability benefits. – jonrsharpe Sep 08 '15 at 10:47

3 Answers3

12

Yes: use Python 3.5 (which is still currently a RC, but should be out momentarily). In Python 3.5, os.walk was rewritten to be more efficient.

This work done as part of PEP 471.

Extracted from the PEP:

Python's built-in os.walk() is significantly slower than it needs to be, because -- in addition to calling os.listdir() on each directory -- it executes the stat() system call or GetFileAttributes() on each file to determine whether the entry is a directory or not.

But the underlying system calls -- FindFirstFile / FindNextFile on Windows and readdir on POSIX systems -- already tell you whether the files returned are directories or not, so no further system calls are needed. Further, the Windows system calls return all the information for a stat_result object on the directory entry, such as file size and last modification time.

In short, you can reduce the number of system calls required for a tree function like os.walk() from approximately 2N to N, where N is the total number of files and directories in the tree. (And because directory trees are usually wider than they are deep, it's often much better than this.)

In practice, removing all those extra system calls makes os.walk() about 8-9 times as fast on Windows, and about 2-3 times as fast on POSIX systems. So we're not talking about micro-optimizations. See more benchmarks here.

Thomas Orozco
  • 53,284
  • 11
  • 113
  • 116
  • 4
    (1) you could use [`scandir` library](https://github.com/benhoyt/scandir) on earlier Python versions (2) [don't expect performance improvement on POSIX systems (compared to `os.fwalk()`)](http://bugs.python.org/issue22524). Measure it. – jfs Sep 08 '15 at 10:42
6

A method to optimize it in python2.7, use scandir.walk() instead of os.walk(), the parameters are exactly the same.

import scandir
directory = "/tmp"
res = scandir.walk(directory)
for item in res:
    print item

PS: Just as @recoup mentioned in comment, scandir needs to be installed before usage in python2.7.

buxizhizhoum
  • 1,719
  • 1
  • 23
  • 32
  • [scandir needs installing via PyPI](https://pypi.python.org/pypi/scandir) in 2.7, it's only in the standard library from Python 3.5. See the proposal and how it works at https://www.python.org/dev/peps/pep-0471/ – rcoup Jul 28 '17 at 10:01
  • yes, thanks for reminder, scandir needs to install with pip install scandir before import. – buxizhizhoum Jul 29 '17 at 06:17
4

os.walk is currently quite slow because it first lists the directory and then does a stat on each entry to see if it is a directory or a file.

An improvement is proposed in PEP 471 and should be coming soon in Python 3.5. In the meantime you could use the scandir package to get the same benefits in Python 2.7

strubbly
  • 3,347
  • 3
  • 24
  • 36