Here is your answer on how to traverse a large directory file by file!
I searched like a maniac for a Windows DLL that will allow me to do what is done on Linux, but no luck.
So, I concluded that the only way is to create my own DLL that will expose those static functions to me, but then I remembered pywintypes.
And, YEEY! this is already done there. And, even more, an iterator function is already implemented! Cool!
A Windows DLL with FindFirstFile(), FindNextFile() and FindClose() may be still somewhere there but I didn't find it. So, I used pywintypes.
EDIT:
I discovered (much late) that these functions are available from kernel32.dll. Whole time hiding just in front of my nose.
Sorry for dependency. But I think that you can extract win32file.pyd from ...\site-packages\win32 folder and eventual dependencies and distribute it independent of win32types with your program if you have to.
As you will see from the speed tests returning a generator is very fast.
After this, you will be able to go file by file and do whatever you want.
NOTE: win32file.FindFilesIterator() returns whole stat of the file/dir, therefore, using my listdir() to get the name and afterwards os.path.get*time() or os.path.is*() doesn't make sense. Better modify my listdir() for those checks.
Now, getting full solution for your problem is still problematic.
Bad news for you is that this starts at the first item in the directory it likes and you cannot choose which one it'll be. In my tests it always returned the sorted directory. (on Windows)
Half-good news is that you can on Windows use wildcards to control which files will you list. So, to use this on a constantly filling directory, you can mark new coming files with version and do something like:
bunch = 1
while True:
for file in listdir("mydir\\*bunch%i*" % bunch): print file
sleep(5); bunch += 1
But you'll have to design this very cleverly, else you will have files that arrived but you didn't find them because they were late.
I don't know whether FindFilesIterator() will continue detecting new files when they come if you introduce delay between loop turns.
If it did, this may also be your solution.
You can always make an iterator in advance and then call the next() method to get the next file:
i = listdir(".")
while True:
try: name = i.next()
except StopIteration: sleep(1)
# This probably won't work as imagined though
You can decide on how long to wait for new files based on the size of the last arrived files. Wild guessing that all incoming files will be roughly the same size plus or minus something.
However, win32file offers you some functions that can help you monitor the directory for changes and I think that this is your best bett.
On speed tests you can also see that constructing a list from this iterator is slower than calling os.listdir(), but os.listdir() will block, my listdir() will not.
Its purpose is not to create lists of files anyway. Why this speed loss appears I don't know. Can only guess something around DLL calls, list construction, sorting or something like that. os.listdir() is completely written in C.
Some usages you can see in if name=="main" block. Save the code in listdir.py and 'from listdir import *' it.
Here is the code:
#! /usr/bin/env python
"""
An equivalent of os.listdir() but as a generator using ctypes on
Unixoides and pywintypes on Windows.
On Linux there is shared object libc.so that contains file manipulation
functions we need: opendir(), readdir() and closedir().
On Windows those manipulation functions are provided
by static library header windows.h. As pywintypes is a wrapper around
this API we will use it.
kernel32.dll contains FindFirstFile(), FindNextFile() and FindClose() as well and they can be used directly via ctypes.
The Unix version of this code is an adaptation of code provided by user
'jason-orendorff' on Stack Overflow answering a question by user 'adrien'.
The original URL is:
http://stackoverflow.com/questions/4403598/list-files-in-a-folder-as-a-stream-to-begin-process-immediately
The Unix code is tested on Raspbian for now and it works. A reasonable
conclusion is that it'll work on all Debian based distros as well.
NOTE: dirent structure is not the same on all distros, so the code will break on some of them.
The code is also tested on Cygwin using cygwin1.dll and it
doesn't work.
If platform isn't Windows or Posix environment, listdir will be
redirected back to os.listdir().
NOTE: There is scandir module implementing this code with no dependencies, excellent error handling and portability. I found it only after putting together this code. scandir() is now included in standardlib of Python 3.5 as os.scandir().
You definitely should use scandir, not this code.
Scandir module is available on pypi.python.org.
"""
import sys, os
__all__ = ["listdir"]
if sys.platform.startswith("win"):
from win32file import FindFilesIterator
def listdir (path):
"""
A generator to return the names of files in the directory passed in
"""
if "*" not in path and "?" not in path:
st = os.stat(path) # Raise an error if dir doesn't exist or access is denied to us
# Check if we got a dir or something else!
# Check gotten from stat.py (for fast checking):
if (st.st_mode & 0170000) != 0040000:
e = OSError()
e.errno = 20; e.filename = path; e.strerror = "Not a directory"
raise e
path = path.rstrip("\\/")+"\\*"
# Else: Decide that user knows what she/he is doing
for file in FindFilesIterator(path):
name = file[-2]
# Unfortunately, only drives (eg. C:) don't include "." and ".." in the list:
if name=="." or name=="..": continue
yield name
elif os.name=="posix":
if not sys.platform.startswith("linux"):
print >> sys.stderr, "WARNING: Environment is Unix but platform is '"+sys.platform+"'\nlistdir() may not work properly."
from ctypes import CDLL, c_char_p, c_int, c_long, c_ushort, c_byte, c_char, Structure, POINTER
from ctypes.util import find_library
class c_dir(Structure):
"""Opaque type for directory entries, corresponds to struct DIR"""
pass
c_dir_p = POINTER(c_dir)
class c_dirent(Structure):
"""Directory entry"""
# FIXME not sure these are the exactly correct types!
_fields_ = (
('d_ino', c_long), # inode number
('d_off', c_long), # offset to the next dirent
('d_reclen', c_ushort), # length of this record
('d_type', c_byte), # type of file; not supported by all file system types
('d_name', c_char * 4096) # filename
)
c_dirent_p = POINTER(c_dirent)
c_lib = CDLL(find_library("c"))
# Extract functions:
opendir = c_lib.opendir
opendir.argtypes = [c_char_p]
opendir.restype = c_dir_p
readdir = c_lib.readdir
readdir.argtypes = [c_dir_p]
readdir.restype = c_dirent_p
closedir = c_lib.closedir
closedir.argtypes = [c_dir_p]
closedir.restype = c_int
def listdir(path):
"""
A generator to return the names of files in the directory passed in
"""
st = os.stat(path) # Raise an error if path doesn't exist or we don't have permission to access it
# Check if we got a dir or something else!
# Check gotten from stat.py (for fast checking):
if (st.st_mode & 0170000) != 0040000:
e = OSError()
e.errno = 20; e.filename = path; e.strerror = "Not a directory"
raise e
dir_p = opendir(path)
try:
while True:
p = readdir(dir_p)
if not p: break # End of directory
name = p.contents.d_name
if name!="." and name!="..": yield name
finally: closedir(dir_p)
else:
print >> sys.stderr, "WARNING: Platform is '"+sys.platform+"'!\nFalling back to os.listdir(), iterator generator will not be returned!"
listdir = os.listdir
if __name__ == "__main__":
print
if len(sys.argv)!=1:
try: limit = int(sys.argv[2])
except: limit = -1
count = 0
for name in listdir(sys.argv[1]):
if count==limit: break
count += 1
print repr(name),
print "\nListed", count, "items from directory '%s'" % sys.argv[1]
if len(sys.argv)!=1: sys.exit()
from timeit import *
print "Speed test:"
dir = ("/etc", r"C:\WINDOWS\system32")[sys.platform.startswith("win")]
t = Timer("l = listdir(%s)" % repr(dir), "from listdir import listdir")
print "Measuring time required to create an iterator to list a directory:"
time = t.timeit(200)
print "Time required to return a generator for directory '"+dir+"' is", time, "seconds measured through 200 passes"
t = Timer("l = os.listdir(%s)" % repr(dir), "import os")
print "Measuring time required to create a list of directory in advance using os.listdir():"
time = t.timeit(200)
print "Time required to return a list for directory '"+dir+"' is", time, "seconds measured through 200 passes"
t = Timer("l = []\nfor file in listdir(%s): l.append(file)" % repr(dir), "from listdir import listdir")
print "Measuring time needed to create a list of directory using our listdir() instead of os.listdir():"
time = t.timeit(200)
print "Time required to create a list for directory '"+dir+"' using our listdir() instead of os.listdir() is", time, "seconds measured through 200 passes"