4

I am looking for a way to iterate through a directory containing 100,000's of files. Using os.listdir is TERRIBLY slow because this function first takes in the path list from the whole specified path.

What are the fastest options?

NOTE: whoever downvoted has never faced this situation for sure.

jldupont
  • 93,734
  • 56
  • 203
  • 318
  • http://stackoverflow.com/questions/120656/directory-listing-in-python – squiguy Aug 30 '12 at 22:51
  • 1
    possible duplicate of [List files in a folder as a stream to begin process immediately](http://stackoverflow.com/questions/4403598/list-files-in-a-folder-as-a-stream-to-begin-process-immediately) – Nemo Aug 30 '12 at 22:54
  • 1
    @squiguy: the question you refer to is not the same as what I am after. – jldupont Aug 30 '12 at 22:54
  • How fast does `ls -U` start returning results? By not needing to sort the files it may be able to feed them to you via a subprocess pipe – John La Rooy Aug 30 '12 at 23:10
  • 1
    possible duplicate of [partial directory listing](http://stackoverflow.com/questions/12170157/partial-directory-listing) – unutbu Aug 31 '12 at 01:12

2 Answers2

1

This other question was referred to in comments as a duplicate:
List files in a folder as a stream to begin process immediately

...But I found the example to be semi not working. Here is the fixed version that works for me:

from ctypes import CDLL, c_int, c_uint8, c_uint16, c_uint32, c_char, c_char_p, Structure, POINTER
from ctypes.util import find_library

import os

class c_dir(Structure):
    pass

class c_dirent(Structure):
    _fields_ = [ 
        ("d_fileno", c_uint32), 
        ("d_reclen", c_uint16),
        ("d_type", c_uint8), 
        ("d_namlen", c_uint8),
        ("d_name", c_char * 4096),
        # proper way of getting platform MAX filename size?
        # ("d_name", c_char * (os.pathconf('.', 'PC_NAME_MAX')+1) ) 
    ]

c_dirent_p = POINTER(c_dirent)
c_dir_p = POINTER(c_dir)

c_lib = CDLL(find_library("c"))
opendir = c_lib.opendir
opendir.argtypes = [c_char_p]
opendir.restype = c_dir_p

# FIXME Should probably use readdir_r here
readdir = c_lib.readdir
readdir.argtypes = [c_dir_p]
readdir.restype = c_dirent_p

closedir = c_lib.closedir
closedir.argtypes = [c_dir_p]
closedir.restype = c_int

def listdir(path):
    """
    A generator to return the names of files in the directory passed in
    """
    dir_p = opendir(".")
    try:
        while True:
            p = readdir(dir_p)
            if not p:
                break
            name = p.contents.d_name
            if name not in (".", ".."):
                yield name
    finally:
        closedir(dir_p)


if __name__ == "__main__":
    for name in listdir("."):
        print name
Community
  • 1
  • 1
jdi
  • 90,542
  • 19
  • 167
  • 203
0

What are you doing to each file in the directory? I think there isn't really a choice about using os.listdir, but depending on what you are doing, you might be able to process files in parallel. For example, we could use the Pool from the multiprocessing library to spawn off more Python processes, then have each process iterate over a smaller subset of the files.

http://docs.python.org/library/multiprocessing.html

This is kind of rough, but I think it gets the point across...

import sys
import os
from processing import Pool

p = Pool(3)
def work(subsetOfFiles):
    for file in subsetOfFiles:
        with open(file, 'r') as f:
           #read file, do work
    return "data"

p.map(work, [[#subSetFiles1],[#subSetFiles2],[#subSetFiles3]])

The general idea is to get the list of files from os.listdir, but instead of going over 100,000 files one by one, we split 100,000 files into 20 lists of 5,000 files, and process 5,000 files in each process. One of the nice things about this approach is it will benefit from the current trend of multicore systems.

Wulfram
  • 3,292
  • 2
  • 15
  • 11
  • I think the OP's issue is that the call to `os.listdir` itself takes a long time because of the amount of items in that directory. So in this case, the map wouldn't start until that entire list had been acquired. – jdi Aug 31 '12 at 00:34
  • Thanks, I misread the question a little. I think even in that case, you could use the approach I outlined above. Instead of getting the list of files all at once and then dividing it for the worker processes, you could have each worker process grab an equal subset of the files in the directory (maybe via a direct shell call). I just believe that when we are talking about 100,000s of files, divide and conquer is a good approach, and you would do this via processes because of the global interpreter lock. – Wulfram Aug 31 '12 at 08:01
  • Disk IO isnt usually a problem with the GIL, so threads are still fine I am sure. The GIL wouldn't be held during a system blocking call. But even the divid and conquer approach... how do you split up the files in the directory ahead of time? No matter what, a directory listing has to occur, which again is the holdup. What you do with the paths in terms of work is really the second step. – jdi Aug 31 '12 at 16:08