Avoiding previous files

Question

I have a directory with a lot of sub directories.

I am running through these directories and finding some files and running some commands on these files. How can I place a pointer to where i finish at? Sometimes the process gets disrupted and the next time I run the program, I want to start at where I left off.

def locate(pattern, root=os.curdir):
    '''Locate all files matching supplied filename pattern in and below
    supplied root directory.'''

    for path, dirs, files in os.walk(os.path.abspath(root)):
        for filename in fnmatch.filter(files, pattern):
            yield os.path.join(path, filename)



for filename in locate("*.dll"):
#do something

Similar question about adding state to a generator is found here [link]http://stackoverflow.com/questions/1939015/singleton-python-generator-or-pickle-a-python-generator — beer_monk, Feb 16 '11 at 20:59

score 1 · Accepted Answer · answered Feb 16 '11 at 21:03

There are a couple ways you could do it… But probably the simplest is creating, eg, a new file along side each of the files that has already been processed, then checking for it. For example:

for filename in locate("*.dll"):
    if os.path.exists(filename + ".processed"):
        continue
    process(filename)
    open(filename + ".processed", "w").close()

for filename in locate("*.processed"):
    os.remove(filename)

senderle · Answer 2 · 2011-02-17T04:39:40.397

I dislike clutter, and I imagine there may be some time between termination and resumption of the script. Hence my preferred approach would be to create a file in the root directory with a list of files that have been processed:

rootdir = os.curdir if len(sys.argv) < 2 else sys.argv[1] # or something
logfilename = os.path.join(rootdir, 'processed')
if os.path.exists(logfilename):
    with open(logfilename, 'r') as logfile:
        processed = set(logfile.read().split())
else:
    processed = set()

filegen = (f for f in locate("*.pdf", rootdir) if f not in processed)
with open(logfilename, 'a') as logfile:
    for filename in filegen:
        do_something(filename)
        logfile.write(filename + '\n')

os.remove(logfilename)

Of course this only works if you run the script on the same dir after failing; if that's a problem, David Wolever's solution is an option, or you could set a fixed location for the logfile. Another interesting approach would be to leave a "breadcrumb" in each directory that has been traversed. You'd probably wind up reprocessing a few files over again, but that would be no great loss.

Avoiding previous files

2 Answers2