Updating os.walk file lists in real-time [python]

Question

I have a function which I want to enumerate through all files and folders from a target folder. When/if it finds rar files I want it to extract them and then delete them. In the case of multi-part archives it will also check for and delete the remaining files (which have already been extracted with the first volume).

I was using os.listdir in a for loop, but the problem with this approach is: a) I don't think it will handle subfolders without writing a recursion loop for them (which I don't want to do because recursion hurts my head). b) because the for loop creates its dictionary(?) of items only at the beginning, when it loops to a file name that has already been removed in a prior iteration I will get a failure to find the file.

It appears os.walk may be better for "a)" above, and my research so far shows that I should be able to update the os.walk in realtime on each iteration. However I can't figure out how to do this.

I've got something like this:

for root, dirs, files in os.walk('d:\\test'):
    for file in files:
        print 'files (before remove): ', file, files
        # This is where I would do some operation that deletes one or more files.
        files.remove(file)
        print 'files (after remove): ', file, files

However the output is like this:

D:\test>d:\Python27\python.exe d:\file.py
files (before remove):  Crystal.part01.rar ['Crystal.part01.rar', 'Crystal.part02.rar', 'Crystal.part03.rar', 'Crystal.part04.rar', 'Crystal.part05.rar', 'Crystal.part06.rar']
files (after remove):  Crystal.part01.rar ['Crystal.part02.rar', 'Crystal.part03.rar', 'Crystal.part04.rar', 'Crystal.part05.rar', 'Crystal.part06.rar']
files (before remove):  Crystal.part03.rar ['Crystal.part02.rar', 'Crystal.part03.rar', 'Crystal.part04.rar', 'Crystal.part05.rar', 'Crystal.part06.rar']
files (after remove):  Crystal.part03.rar ['Crystal.part02.rar', 'Crystal.part04.rar', 'Crystal.part05.rar', 'Crystal.part06.rar']
files (before remove):  Crystal.part05.rar ['Crystal.part02.rar', 'Crystal.part04.rar', 'Crystal.part05.rar', 'Crystal.part06.rar']
files (after remove):  Crystal.part05.rar ['Crystal.part02.rar', 'Crystal.part04.rar', 'Crystal.part06.rar']

I think this makes sense...we can see the list getting updated, however because I am already stuck in the (second) For statement that has created a list of the files it continues to try to loop through the original list order which is now offset by one, creating a "skip" effect.

How can I achieve operating on each file in the directory, except letting the calling loop know to skip an item that has been removed?

Update - I may be incorrect in assuming this can be done. What gave me this idea was this snipped from the python docs:

When topdown is True, the caller can modify the dirnames list in-place (perhaps using del or slice assignment), and walk() will only recurse into the subdirectories whose names remain in dirnames; this can be used to prune the search, impose a specific order of visiting, or even to inform walk() about directories the caller creates or renames before it resumes walk() again. Modifying dirnames when topdown is False has no effect on the behavior of the walk, because in bottom-up mode the directories in dirnames are generated before dirpath itself is generated.

On reading it again I see it only mentions dirnames and not filename - so while I still don't understand the exact method to accomplish this, it looks like you may only be able to manipulate the dirnames in place.

that doesn't do anything except make it a standard os.walk loop through the files. If the processing of Crystal.part01.rar (the first in the loop) was to also include the deletion of parts 2-5 then the script would terminate with a `The system cannot find the file specified:` error. I need to let the calling loop/walk/whatever know that certain files have been removed and it shouldn't try to process them. — BenH, Sep 11 '17 at 19:08
To add on since you may be talking over my head by being terse - I know that I can simply allow the script to error here and catch the exception - under the assumption that we should expect files to be missing if we just deleted them. However I'm not sure that is really best practice(?) since we would expect/continue on any IOError which may be too broad... — BenH, Sep 11 '17 at 19:11
If you want to use the result to modify the underlying filesystem, you can loop through the os calls, store them in a list, then perform your operations on the list, after that's done. You can always call os.walk again, once you are done modifying things. — Kenny Ostrom, Sep 11 '17 at 20:40
I updated the OP to correct/specify what I was referring to. would still like to see a workable code example that meets my use case. thanks — BenH, Sep 11 '17 at 20:56
The docs you posted say you can delete or reorder the lists (meaning they are stored in memory for you), but it doesn't say it will discover new folders which appear on the actual filesystem. In fact, according to the design implied by the docs here it must not. It should assume it found them the first time, but you deleted them from the list. If you want it to find new folders, you need to manually add them to that list (if that works), or requery the filesystem. (ps: you could have linked the docs) — Kenny Ostrom, Sep 11 '17 at 20:59
Python docs: https://docs.python.org/2/library/os.html It's the `del` and "slice assignments" that I don't totally comprehend, but it says it can be modified "in place" - but I don't know what effect that actually has on once the walk loop is begun. For my case it won't matter since I'm looking for filenames not dirnames, but I would still like to understand the usage. — BenH, Sep 11 '17 at 23:34

wwii · Answer 1 · 2017-09-13T18:00:39.653

for root, dirs, files in os.walk('d:\\test'):
    for file in files:
        #process stuff

files is a list that you are iterating over - you should not modify it, as you have discovered. When you process stuff if you delete a file that hasn't been reached in the for loop iteration then you can do three things (that I can think of)

Check to see if the file is there before you process it

if fname not in os.listdir(os.getcwd()):
    continue

Use a try/except to catch the IOError. If you want to limit the exception handling further, you can query the error text for "No such file or directory: 'yourfilehere'" in the except suite and re-raise the exception if it is something different.
```
fname = 'foo.bar'
try:
    with open(fname) as f:
        pass
except IOError as e:
    #print(e, str(e), repr(e))
    if 'No such file' in str(e):
        pass
else:
    raise
```
I guess you could keep a separate list/set that contains all the files that your process has deleted and check if the file is in it before trying to process it.

If you really needed to, you can write a class with the behavior you need.

#Python 2.7 code
import collections
class F(collections.deque):
    def __iter__(self):
        return self
    def next(self):
        try:
            return self.pop()
        except IndexError:
            raise StopIteration

a = [1,2,3,4]
f = F(a)
for n in f:
    print n
    if n == 3:
        f.remove(2)

Result

>>> 
4
3
1
>>>

Thanks. I have come to those two conclusions as well. For now I have restructured my functions so that the process function is called on one file at a time. It is possible it is less efficient than the other 2 options. I'm interested also in @KennyOstrom's list possibility. However is there any construct which would allow for dynamic removal from a for-type iterative loop? Am I wrong in my interpretation of the os.walk documentation that states this can be done for the dirnames? — BenH, Sep 11 '17 at 23:24

Updating os.walk file lists in real-time [python]

1 Answers1