1

I recently asked this question and got a wonderful answer to it involving the os.walk command. My script is using this to search through an entire drive for a specific folder using for root, dirs, files in os.walk(drive):. Unfortunately, on a 600 GB drive, this takes about 10 minutes.

Is there a better way to invoke this or a more efficient command to be using? Thanks!

Community
  • 1
  • 1
wnnmaw
  • 5,444
  • 3
  • 38
  • 63
  • 2
    That's about how long it takes to go through a 600GB drive. If you want better performance, you should think about a strategy that is not brute-force searching the entire drive. – roippi Aug 08 '13 at 20:57
  • The only way to make this any faster would be (a) maintain an index of folders so you don't have to search, (b) access an index maintained by your OS (like Windows Desktop Search or Spotlight), or (c) access an index maintained by the filesystem (which will require calling low-level functions that aren't wrapped by Python, via `ctypes` or similar, and may require root/admin privileges, if it's even possible). – abarnert Aug 08 '13 at 21:01

3 Answers3

3

If you're just looking for a small constant improvement, there are ways to do better than os.walk on most platforms.

In particular, walk ends up having to stat many regular files just to make sure they're not directories, even though the information is (Windows) or could be (most *nix systems) already available from the lower-level APIs. Unfortunately, that information isn't available at the Python level… but you can get to it via ctypes or by building a C extension library, or by using third-party modules like scandir.

This may cut your time to somewhere from 10% to 90%, depending on your platform and the details of your directory layout. But it's still going to be a linear search that has to check every directory on your system. The only way to do better than that is to access some kind of index. Your platform may have such an index (e.g., Windows Desktop Search or Spotlight); your filesystem may as well (but that will require low-level calls, and may require root/admin access), or you can build one on your own.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • So, in short, there's no easy way to do it with `os.walk`, and instead, I'll have to use the third party `scandir`? – wnnmaw Aug 09 '13 at 16:39
  • @wnnmaw: If by "do it" you mean "get a small constant improvement" then yes, `os.walk` can't go any faster than `os.walk`, but `scandir` or other third-party libs can. If you want a more radical improvement, you'll need to radically change what you're doing. – abarnert Aug 09 '13 at 17:34
1

Use subprocess.Popen to start a native 'find' process.

Scruffy
  • 908
  • 1
  • 8
  • 21
  • `find` is effectively doing the same thing as `os.walk`. It can be faster by a constant factor (because it doesn't have to stat as many files), but it's still walking everything. And of course it only works on Unix. – abarnert Aug 08 '13 at 21:07
  • On windows, that would be `out, err = subprocess.Popen('dir /s /ad c:\myfolder', shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE).communicate()` – tdelaney Aug 08 '13 at 21:07
0

scandir.walk(path) gives 2-20 times faster results then the os.walk(path). you can use this module pip install scandir here is docs for scandir

ShivaGuntuku
  • 5,274
  • 6
  • 25
  • 37