23

I am trying to improve performance of elfinder , an ajax based file manager(elRTE.ru) .

It uses os.listdir in a recurisve to walk through all directories recursively and having a performance hit (like listing a dir with 3000 + files takes 7 seconds ) ..

I am trying to improve performance for it here is it's walking function:

        for d in os.listdir(path):
            pd = os.path.join(path, d)
            if os.path.isdir(pd) and not os.path.islink(pd) and self.__isAccepted(d):
                tree['dirs'].append(self.__tree(pd))

My questions are :

  1. If i change os.walk instead of os.listdir , would it improve performance?
  2. how about using dircache.listdir() ? cache WHOLE directory/subdir contents at the initial request and return cache results , if theres no new files uploaded or no changes in file?
  3. Is there any other method of Directory walking which is faster?
  4. Any Other Server Side file browser which is fast written in python (but i prefer to make this one fast)?
Phyo Arkar Lwin
  • 6,673
  • 12
  • 41
  • 55
  • 2
    What are you using this data for? If you can afford to do the recursion lazily (only call `os.listdir()` when you need the contents, not automatically when you find a new directory) then you can amortize the cost over lots of requests. That's how most file managers work in practice. – Daniel Pryden Jul 01 '10 at 23:12
  • This data is for a ajax-based filemanager , called elfinder from elrte.ru . it is nice one but problem is being too slow due to the function i pasted. Yours one looks practical , i will change it to look for each directory instead of whole recursiely. – Phyo Arkar Lwin Jul 01 '10 at 23:29
  • `os.walk()` will not be faster that your walking function because they does mostly the same things. `os.walk()` use `os.listdir()`, `os.pathisdir()`, etc. Check the code of `os.walk()` and you will see! – Etienne Jul 02 '10 at 01:59
  • 2
    **2017 update**: A lot of information is outdated here now. Namely, `os.walk` no longer uses `listdir`, now it's using the faster [`scandir`](https://www.python.org/dev/peps/pep-0471/). – wim Feb 22 '17 at 21:41
  • @wim which version it started using? Not available in 2.7 right ? – Phyo Arkar Lwin Feb 25 '17 at 20:47

10 Answers10

27

I was just trying to figure out how to speed up os.walk on a largish file system (350,000 files spread out within around 50,000 directories). I'm on a linux box usign an ext3 file system. I discovered that there is a way to speed this up for MY case.

Specifically, Using a top-down walk, any time os.walk returns a list of more than one directory, I use os.stat to get the inode number of each directory, and sort the directory list by inode number. This makes walk mostly visit the subdirectories in inode order, which reduces disk seeks.

For my use case, it sped up my complete directory walk from 18 minutes down to 13 minutes...

garlon4
  • 1,162
  • 10
  • 14
  • 1
    Actually this is the fastest way. Thanks but answer is already chosen. – Phyo Arkar Lwin Aug 08 '13 at 13:35
  • 2
    Nice trick garlon4, it is hard to think it this way without your hint. And @V3ss0n, I think you can still change your chosen answer at anytime, if you want to. – RayLuo Oct 19 '13 at 14:10
  • If you performance is key, you don't need portability because you are in Linux, and your list is "static". You don't get new files frequently. Then i'd consider to run an external process using native command like 'find', or even execute some c script like: https://stackoverflow.com/questions/4204666/how-to-list-files-in-a-directory-in-a-c-program , and output them into a file, and then read this file from python. This is a MUCH faster solution in the given scenario. – miguelfg Jun 06 '17 at 16:17
  • In my use case, I explicitly didn't know if files and directories had changed or not (which is why I was walking them). – garlon4 Jun 09 '17 at 18:42
17

Did you check out scandir (previously betterwalk)? Did not try it myself, but there's a discussion about it here and another one here. It claims to have a speedup of 3~10x on MacOSX/Linux and 7~50x on Windows by avoiding redundant calls to os.stat(). It's also now included in the standard library as of Python 3.5.

Python's built-in os.walk() is significantly slower than it needs to be, because -- in addition to calling listdir() on each directory -- it calls stat() on each file to determine whether the filename is a directory or not. But both FindFirstFile / FindNextFile on Windows and readdir on Linux/OS X already tell you whether the files returned are directories or not, so no further stat system calls are needed. In short, you can reduce the number of system calls from about 2N to N, where N is the total number of files and directories in the tree.

In practice, removing all those extra system calls makes os.walk() about 7-50 times as fast on Windows, and about 3-10 times as fast on Linux and Mac OS X.

From the project's readme.

gaborous
  • 15,832
  • 10
  • 83
  • 102
  • i am going to try it , thats nice. – Phyo Arkar Lwin Nov 16 '15 at 08:50
  • 4
    `scandir` is included in Python 3.5's `os` module. – funky-future May 21 '16 at 21:48
  • 12
    **Note**: `scandir` is now included in Python, and [**it's actually used by** `os.walk`](https://github.com/python/cpython/blob/cb41b2766de646435743b6af7dd152751b54e73f/Lib/os.py#L348). So if you were thinking of trying `scandir` as a faster replacement, you can forget about that! – wim Feb 22 '17 at 21:39
  • @gaborous, are there any benchmarks for how long it takes on ~1M files? – Vass Apr 02 '22 at 14:51
5

You should measure directly on the machines (OSs, filesystems and caches thereof, etc) of your specific interest -- whether or not os.walk is faster than os.listdir on a specific and totally different machine / OS / FS will tell you very little about performance on yours.

Not sure what you mean by cachedir.listdir -- no standard library module / function by that name. listdir already reads all the directory in at one gulp (as it must sort the results) as does os.walk (as it must separate subdirectories from files). If, depending on your platform, you have a fast way of being notified about file/directory changes, then it's probably worth building the tree up once and editing it incrementally as change notifications come... but it depends on the relative frequency of changes vs requests, which is, again, totally dependent on your specific application circumstances.

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
3

In order:

  • I doubt you'll see much of a speed-up between os.walk and os.listdir, since both rely on the underlying filesystem. In fact, I suspect the underlying filesystem is going to have a big effect on the speed of the operation.

  • Any cache operation is going to be significantly faster than hitting the filesystem (at least for the second and subsequent checks).

  • You could always write some utility (or call a shell command) which generates the list of directories outside of Python, and called that through the subprocess module. But that's a little complicated, and I'd turn to that solution only if the cache turned out to not work for you.

  • If you haven't located a file browser on the Cheeseshop, you probably won't find one.

Chris B.
  • 85,731
  • 25
  • 98
  • 139
2

I was looking for a solution to list how many images inside folder, but Colab was timing out with os.listdir() after running several minutes. Fast way was to create iterator with scandir, then filling the filenames in the separate list. Works in seconds.

Answer is similar with others, but putting the code for alternative, and saying that Colab is problematic with large files.

img_files = data_folder 
obj = os.scandir(img_files)
 
# List all files and directories
# in the specified path
print("Files and Directories in '% s':" % img_files)

img_files = []
for entry in obj :
    if entry.is_dir() or entry.is_file():
        img_files.append(entry.name)

len(img_files)
  • 1
    os.lisdir() is blocking operation. It will block the thread . It is very bad idea to run it on google colab . Best way it so run it in a seperate thread. AIOFiles do not have support for listing directtories too. os.scandir is also blocking operation. see here : https://stackoverflow.com/questions/23894515/how-do-i-list-files-in-asyncio – Phyo Arkar Lwin Sep 21 '21 at 13:14
2

Funny thing, discussion on what is faster os.walk or os.listdir led me to this documentation:

Python's built-in os.walk() is significantly slower than it needs to be, because -- in addition to calling os.listdir() on each directory -- it executes the stat() system call or GetFileAttributes() on each file to determine whether the entry is a directory or not.

I guess it answers that :)

geoai777
  • 102
  • 1
  • 8
1

How about doing it in bash?

import subprocess
command = 'ls .... or something else'
subprocess.Popen([command] ,shell=True) 

In my case, which was changing permissions on thousands of files, this has worked much better.

zzart
  • 11,207
  • 5
  • 52
  • 47
  • 1
    Parsing command lines is not pythonic and hacky, i avoid calling commandlines when avaliable via python. and it is not portable.In my case if i need ssh access to target i use paramiko , never ssh client – Phyo Arkar Lwin Aug 08 '13 at 13:37
  • It's not portable - agreed. I never mentioned anything about ssh.If calling cammandlines is 'not pythonic and hacky' why is subprocess module included in python ? Anyways, I've suggessted native bash which is much faster travesting dir trees than any python. – zzart Aug 17 '13 at 20:42
  • >native bash which is much faster travesting I would like to see the performance benchmark for your claim. >not pythonic and hacky' why is subprocess module included in python I am sure you are not familiar with what pythonic means. – Phyo Arkar Lwin Aug 23 '13 at 13:21
  • 1
    I was actually considering this idea for just regular directory listings...not walking trees. `find` might be a fast solution for walking trees. True it is not pythonic, but if it is much faster, sometimes we have to make do. Whenever I need to md5 something, I use subprocess because by the same principle, it is much faster than running it in native python. – Sean DiZazzo Apr 05 '14 at 08:00
1

I know this is an old thread but I just had to make the same decision now, so posting the results. With all the updates to Python 3.5+, os.walk() is the fastest way to do this, compared to os.listdir() and os.scandir().

I was collecting files within two master folders and about 30 folders in each master folder.

files_list = [os.path.join(dir_, root, f)
              for dir_ in folder_list
              for root, dirs, files in os.walk(dir_)
              for f in files
              if (os.path.basename(f).startswith(prefix) and f.endswith(ext))]

Results of my tests:
os.scandir(): 10,949 files, 35.579052 seconds
os.listdir(): 10,949 files, 35.197001 seconds
os.walk(): 10,949 files, 01.544174 seconds

Bish
  • 189
  • 2
  • 8
  • 2
    There is obviously something going on here, os.walk cannot be that fast, it probably leveraged some kind of OS caching to achieve such a fast result, so this result is likely not representative of all situations. Anyway, Python 3.5+ optimized all files walking functions so likely the redundant calls aren't an issue anymore. – gaborous Apr 09 '22 at 12:49
0

You are looking for fsdir. It's written in C and is made to work with python. It is much faster than walking the tree with standard python libraries.

Sean DiZazzo
  • 679
  • 8
  • 13
0

os.path.walk may increase your performance, for two reasons:

1) If you can stop walking before you've walked everything, then indeed it will be faster than listdir, although only noticeable when dealing with large trees

2) If you're listing HUGE directories, then it can be expensive to make the list returned by listdir. (Not true, see alex's comment below)

However, it probably won't make a difference and may in fact be slower, due to the potentially extra overhead incurred by calling your visit function and doing all the extra argument packing and unpacking.

(Really the only way to answer this question is to test it yourself - it should only take a few minutes)

Nick Bastin
  • 30,415
  • 7
  • 59
  • 78
  • Both the relatively-new os.walk and the old-and-crusty os.path.walk necessarily read each directory entirely because they must present the names in it as one or two lists (os.path.walk is specified in the docs as using os.listdir, but how do you think os.walk does it?-). So (2) doesn't really apply. – Alex Martelli Jul 01 '10 at 22:30
  • Well, phooey. I still stand by my admonition that one should test these things.. :-) – Nick Bastin Jul 01 '10 at 22:44
  • so thats mean no performance difference.. But atleast with os.walk , wont need to be doing : os.path.isdir(pd) and not os.path.islink(pd) as it will give out files/dirs separately right? Alrtie i am going to test it and let you know! – Phyo Arkar Lwin Jul 01 '10 at 23:42