1

I have a huge directory that keeps getting updated all the time. I am trying to list only the latest 100 files in the directory using python. I tried using os.listdir(), but when the size of directory approaches 1,00,000 files, it seems as though listdir() crashes( or i have not waited long enough). I only need the first 100 files (or filenames) for further processing, so i dont want listdir() to be filled with all the 100000 files. Is there a good way of doing this in Python?

PS: I am very new to programming

Craig S. Anderson
  • 6,966
  • 4
  • 33
  • 46
Prashant
  • 11
  • 1
  • 2
  • 1
    how are you deciding what are the *latest* hundred? – Padraic Cunningham Jul 15 '15 at 09:30
  • 1
    from what i can find: http://stackoverflow.com/questions/168409/how-do-you-get-a-directory-listing-sorted-by-creation-date-in-python all the listem methods use `os.listdir` in some way, so it will take longer and longer as your dir grows in size/number of files, perhaps a better approach would be a directory watch to look for new files/file updates and act on individual files, but this would have to be a permanently running process. see https://pypi.python.org/pypi/watchdog for tips on how to do this – James Kent Jul 15 '15 at 09:54
  • You said OS, but not which one. Or is this supposed to be cross platform? – Dalen Jul 15 '15 at 10:05
  • I am using Windows. To decide on the latest files, I found this http://stackoverflow.com/questions/11259273/find-files-folders-that-are-modified-after-a-specific-date-in-python, if I am willing to wait long enough for the os.listdir() to be populated I can use os.stat().st_mtime on each file, but I dont want to wait for os.listdir() to be filled – Prashant Jul 15 '15 at 10:55
  • How about using subprocess to run a dir command to find n most recent files? – Padraic Cunningham Jul 15 '15 at 11:43
  • @PadraicCunningham Thanks for the suggestion. I tried it and it works. But turns out that this method too is as slow as os.listdir() because i cannot start processing the files as soon as they are loaded. I stumbled across this : https://www.olark.com/developers-corner/you-can-list-a-directory-with-8-million-files-but-not-with-ls – Prashant Jul 15 '15 at 14:15
  • @Prashant, did you get it working for you? – Padraic Cunningham Jul 15 '15 at 14:42
  • @PadraicCunningham I kind of did.. but from the looks of it, it looks very much like the answer given by shravan, only instead of os.listdir('path'), I am using subprocess.check_output(). I couldnt figure out how to make it stop after getting the first 100 results. i dont know much about the subprocess module, so the efficiency of the code I have written is similar to os.listdir('path')[:100]. As suggested by Dalen, I could try to start processing the files as they come, to speed up the process, but I could not understand the code – Prashant Jul 15 '15 at 16:28

2 Answers2

2

Here is your answer on how to traverse a large directory file by file!

I searched like a maniac for a Windows DLL that will allow me to do what is done on Linux, but no luck.

So, I concluded that the only way is to create my own DLL that will expose those static functions to me, but then I remembered pywintypes. And, YEEY! this is already done there. And, even more, an iterator function is already implemented! Cool!

A Windows DLL with FindFirstFile(), FindNextFile() and FindClose() may be still somewhere there but I didn't find it. So, I used pywintypes.

EDIT: I discovered (much late) that these functions are available from kernel32.dll. Whole time hiding just in front of my nose.

Sorry for dependency. But I think that you can extract win32file.pyd from ...\site-packages\win32 folder and eventual dependencies and distribute it independent of win32types with your program if you have to.

As you will see from the speed tests returning a generator is very fast.

After this, you will be able to go file by file and do whatever you want.

NOTE: win32file.FindFilesIterator() returns whole stat of the file/dir, therefore, using my listdir() to get the name and afterwards os.path.get*time() or os.path.is*() doesn't make sense. Better modify my listdir() for those checks.

Now, getting full solution for your problem is still problematic.

Bad news for you is that this starts at the first item in the directory it likes and you cannot choose which one it'll be. In my tests it always returned the sorted directory. (on Windows)

Half-good news is that you can on Windows use wildcards to control which files will you list. So, to use this on a constantly filling directory, you can mark new coming files with version and do something like:

bunch = 1
while True:
    for file in listdir("mydir\\*bunch%i*" % bunch): print file
    sleep(5); bunch += 1

But you'll have to design this very cleverly, else you will have files that arrived but you didn't find them because they were late.

I don't know whether FindFilesIterator() will continue detecting new files when they come if you introduce delay between loop turns.

If it did, this may also be your solution.

You can always make an iterator in advance and then call the next() method to get the next file:

i = listdir(".")
while True:
    try: name = i.next()
    except StopIteration: sleep(1)
# This probably won't work as imagined though

You can decide on how long to wait for new files based on the size of the last arrived files. Wild guessing that all incoming files will be roughly the same size plus or minus something.

However, win32file offers you some functions that can help you monitor the directory for changes and I think that this is your best bett.

On speed tests you can also see that constructing a list from this iterator is slower than calling os.listdir(), but os.listdir() will block, my listdir() will not. Its purpose is not to create lists of files anyway. Why this speed loss appears I don't know. Can only guess something around DLL calls, list construction, sorting or something like that. os.listdir() is completely written in C.

Some usages you can see in if name=="main" block. Save the code in listdir.py and 'from listdir import *' it.

Here is the code:


#! /usr/bin/env python

"""
An equivalent of os.listdir() but as a generator using ctypes on 
Unixoides and pywintypes on Windows.

On Linux there is shared object libc.so that contains file manipulation 
functions we need: opendir(), readdir() and closedir().
On Windows those manipulation functions are provided 
by static library header windows.h. As pywintypes is a wrapper around 
this API we will use it.
kernel32.dll contains FindFirstFile(), FindNextFile() and FindClose() as well and they can be used directly via ctypes.

The Unix version of this code is an adaptation of code provided by user
'jason-orendorff' on Stack Overflow answering a question by user 'adrien'.
The original URL is:
http://stackoverflow.com/questions/4403598/list-files-in-a-folder-as-a-stream-to-begin-process-immediately

The Unix code is tested on Raspbian for now and it works. A reasonable 
conclusion is that it'll work on all Debian based distros as well.

NOTE: dirent structure is not the same on all distros, so the code will break on some of them.

The code is also tested on Cygwin using cygwin1.dll and it 
doesn't work.

If platform isn't Windows or Posix environment, listdir will be 
redirected back to os.listdir().

NOTE: There is scandir module implementing this code with no dependencies, excellent error handling and portability. I found it only after putting together this code. scandir() is now included in standardlib of Python 3.5 as os.scandir().
You definitely should use scandir, not this code.
Scandir module is available on pypi.python.org.
"""

import sys, os

__all__ = ["listdir"]

if sys.platform.startswith("win"):
    from win32file import FindFilesIterator

    def listdir (path):
        """
        A generator to return the names of files in the directory passed in
        """
        if "*" not in path and "?" not in path:
            st = os.stat(path) # Raise an error if dir doesn't exist or access is denied to us
            # Check if we got a dir or something else!
            # Check gotten from stat.py (for fast checking):
            if (st.st_mode & 0170000) != 0040000:
                e = OSError()
                e.errno = 20; e.filename = path; e.strerror = "Not a directory"
                raise e
            path = path.rstrip("\\/")+"\\*"
        # Else:  Decide that user knows what she/he is doing
        for file in FindFilesIterator(path):
            name = file[-2]
            # Unfortunately, only drives (eg. C:) don't include "." and ".." in the list:
            if name=="." or name=="..": continue
            yield name

elif os.name=="posix":
    if not sys.platform.startswith("linux"):
        print >> sys.stderr, "WARNING: Environment is Unix but platform is '"+sys.platform+"'\nlistdir() may not work properly."
    from ctypes import CDLL, c_char_p, c_int, c_long, c_ushort, c_byte, c_char, Structure, POINTER
    from ctypes.util import find_library

    class c_dir(Structure):
        """Opaque type for directory entries, corresponds to struct DIR"""
        pass

    c_dir_p = POINTER(c_dir)

    class c_dirent(Structure):
        """Directory entry"""
        # FIXME not sure these are the exactly correct types!
        _fields_ = (
            ('d_ino', c_long), # inode number
            ('d_off', c_long), # offset to the next dirent
            ('d_reclen', c_ushort), # length of this record
            ('d_type', c_byte), # type of file; not supported by all file system types
            ('d_name', c_char * 4096) # filename
            )

    c_dirent_p = POINTER(c_dirent)

    c_lib = CDLL(find_library("c"))
    # Extract functions:
    opendir = c_lib.opendir
    opendir.argtypes = [c_char_p]
    opendir.restype = c_dir_p

    readdir = c_lib.readdir
    readdir.argtypes = [c_dir_p]
    readdir.restype = c_dirent_p

    closedir = c_lib.closedir
    closedir.argtypes = [c_dir_p]
    closedir.restype = c_int

    def listdir(path):
        """
        A generator to return the names of files in the directory passed in
        """
        st = os.stat(path) # Raise an error if path doesn't exist or we don't have permission to access it
        # Check if we got a dir or something else!
        # Check gotten from stat.py (for fast checking):
        if (st.st_mode & 0170000) != 0040000:
            e = OSError()
            e.errno = 20; e.filename = path; e.strerror = "Not a directory"
            raise e
        dir_p = opendir(path)
        try:
            while True:
                p = readdir(dir_p)
                if not p: break # End of directory
                name = p.contents.d_name
                if name!="." and name!="..": yield name
        finally: closedir(dir_p)

else:
    print >> sys.stderr, "WARNING: Platform is '"+sys.platform+"'!\nFalling back to os.listdir(), iterator generator will not be returned!"
    listdir = os.listdir

if __name__ == "__main__":
    print
    if len(sys.argv)!=1:
        try: limit = int(sys.argv[2])
        except: limit = -1
        count = 0
        for name in listdir(sys.argv[1]):
            if count==limit: break
            count += 1
            print repr(name),
        print "\nListed", count, "items from directory '%s'" % sys.argv[1]
    if len(sys.argv)!=1: sys.exit()
    from timeit import *
    print "Speed test:"
    dir = ("/etc", r"C:\WINDOWS\system32")[sys.platform.startswith("win")]
    t = Timer("l = listdir(%s)" % repr(dir), "from listdir import listdir")
    print "Measuring time required to create an iterator to list a directory:"
    time = t.timeit(200)
    print "Time required to return a generator for directory '"+dir+"' is", time, "seconds measured through 200 passes"
    t = Timer("l = os.listdir(%s)" % repr(dir), "import os")
    print "Measuring time required to create a list of directory in advance using os.listdir():"
    time = t.timeit(200)
    print "Time required to return a list for directory '"+dir+"' is", time, "seconds measured through 200 passes"
    t = Timer("l = []\nfor file in listdir(%s): l.append(file)" % repr(dir), "from listdir import listdir")
    print "Measuring time needed to create a list of directory using our listdir() instead of os.listdir():"
    time = t.timeit(200)
    print "Time required to create a list for directory '"+dir+"' using our listdir() instead of os.listdir() is", time, "seconds measured through 200 passes"

Dalen
  • 4,128
  • 1
  • 17
  • 35
  • Hi Dalen, thanks a ton for all the effort to find a solution to this. Im a total newbie, so i dont understand much of what youve said, but ill work through it and use it..again, thanks a lot – Prashant Sep 09 '15 at 14:11
  • @Prashant : Don't worry, we all were newbies at some time and at some thing. I edited the post, found a tiny bug upon rereading the code. Ask freely about anything you don't understand. That is how you stop being newbie at something. – Dalen Sep 09 '15 at 23:05
  • If you're wandering about win32types, well, it's a library that eases access to Windows APIs through Python. You can get it on: http://sourceforge.net/projects/pywin32/ – Dalen Sep 09 '15 at 23:07
  • Thank you so much @Dalen ! 7 years later and your code still works like a charm. It really saved me from a lot of trouble (I have to work in a directory with ~ 10 *million* files) – Ph.lpp Jul 20 '22 at 10:43
  • @Ph.lpp : Why shouldn't it work? As long as you use Python 2 you should be fine. But I strongly advise to use Python 3 and os.scandir() if possible, the scandir module otherwise. – Dalen Jul 20 '22 at 18:56
  • @Dalen Naturally, I ported your code to Python 3. Thank you for the hint, I will look into os.scandir(). – Ph.lpp Jul 21 '22 at 10:09
-1

You may try to read a directory directly (as a file) and pick data from there. How successfull would this be is a question of a filesystem you are on. Try first ls or dir commands to see who returns faster. os.listdir() or that funny little program. You'll se that both are in trouble. Here the key is just in that that your directory is flooded with new files. That creates kind of bottle neck.

Dalen
  • 4,128
  • 1
  • 17
  • 35
  • I found this: http://stackoverflow.com/questions/4403598/list-files-in-a-folder-as-a-stream-to-begin-process-immediately It seems that Python doesn't offer low level access to opendir(). I succeeded to open it with os.open() but not to read it. As soon as I can I'll reimplement code in mentioned post for Windows. Meanwhile you can try subprocessing to dir program. Maybe it helps. – Dalen Jul 15 '15 at 11:44
  • P.S. Proposed code doesn't work in Cygwin. I hoped it will. – Dalen Jul 15 '15 at 12:04