0

I am just looking for an opinion here...

I have a script which loops through a list of files (7.000 files to be exact) and removes them from the list, if the file size is 0kb.

files = [...]
files_0 = []

for f in files:
    if os.stat(f).st_size==0:
        files.remove(f)
        files_0.append(f)

Looping through the list takes some time. Would it be better if i used a dictionary {filename:filesize} and append each entry to the appropriate list?

Thanks for helping!

alani
  • 12,573
  • 2
  • 13
  • 23
Daniel_T
  • 39
  • 1
  • 8
  • Does this answer your question? [How do I check if a list is empty?](https://stackoverflow.com/questions/53513/how-do-i-check-if-a-list-is-empty) – Roshin Raphel Jun 24 '20 at 10:50
  • I think the time needed to `stat` the file will overshadow the list vs. dict differences (even if there are any for such a small number of files). Something like `[f for f in files if os.stat(f).st_size != 0]` will give you the result you're looking for. – Noufal Ibrahim Jun 24 '20 at 10:50
  • If it is one time activity, present in same folder and you know those files are not useful for sure, you can follow your approach. But, if it's a repetitive job, I'd say, you'd better 'traverse' through directory, make sure to exclude useful but zero size files example, `_ _init_ _.py` , you may need to define "exclude_files" list while deleting. – drd Jun 24 '20 at 11:09
  • Do not remove items from a list that you are iterating over. That is not going to go well. – alani Jun 24 '20 at 11:11
  • Clarification for @alaniwi 's comment; it is going to fine if you do it in list comprehensions – Zircoz Jun 24 '20 at 11:36
  • @Zircoz No it isn't. `a=[1,2,3,4]` `[a.remove(x) for x in a]` `print(a)` gives `[2,4]`. Every time you remove an item, you skip over the next one because the iteration is using an index. It makes no difference whether you are using an explicit loop or a list comprehension. – alani Jun 24 '20 at 11:42
  • Okay, I got doubt on why this happens but that should be another question. Everyone reading, scratch my last comment; @alaniwi is right. – Zircoz Jun 24 '20 at 11:45
  • @Zircoz You don't see the index explicitly, but under the hood it is no doubt using an iterator, with the index number of the next item as an internal property. – alani Jun 24 '20 at 11:50
  • @alaniwi I thought it uses the first version of list instead of changing it in every iteration. – Zircoz Jun 24 '20 at 18:43
  • Sorry for answering so late but "all work and no play makes jack a dull boy".. Thank you very much. List comprehension seems to be a more performant choice. – Daniel_T Jun 25 '20 at 10:45

4 Answers4

1

Probably better to use a set comprehension because .append() and .remove() are not optimal. Then get the difference from the original list.

files = [...]
files_0 = {f for f in files if os.stat(f).st_size==0}
files = set(files) - files_0

Not tested though.

EDIT: if you need to retain the arrangement of items. Use a list comprehension because as stated above, .append() and .remove are not optimal. Creating a new list is faster.

Ronie Martinez
  • 1,254
  • 1
  • 10
  • 14
0

The remove operation can be pretty slow, as shown here. To be more specific, it has a time complexity of O(N). If you're not familiar with big O notation, that means in the worst case scenario you'll have to parse the entire list just to remove this single element.

You'll achieve a better performance by inverting your logic: instead of removing elements that shouldn't belong in the list, create another list and put only the valid elements there. E.g:

files = [ ... ]
valid_files = []
files_0 = []

for f in files:
    if os.stat(f).st_size == 0:
        files_0.append(f)
    else:
        valid_files.append(f)
Allan Juan
  • 2,048
  • 1
  • 18
  • 42
0

try this ...

import os

files_0 = []
path = "<valid directory path>"

for rootdir,dirs,files in os.walk(path):  
  for f in files:  
    filename = os.path.join(root,f)   
      if os.stat(filename).st_size == 0:  
        print(" Removed file ",filename)  
        os.remove(filename)
        files_0.append(filename)
Abhishek D K
  • 2,257
  • 20
  • 28
0

The most efficient way is certainly to use sets for everything, as in Ronie Martinez's answer.

I'll take the liberty of quoting Ronie's code here, for ease of reference (although with a slight edit to use os.path.getsize):

files = [...]
files_0 = {f for f in files if os.path.getsize(f) == 0}
files = set(files) - files_0

As an addendum, here is what you could do if you actually care about the ordering for some reason:

files = [...]
files_0 = [f for f in files if os.path.getsize(f) == 0]

Followed by either:

files = [f for f in files if file not in files_0]  #

or better (because sets are much quicker than lists for testing inclusion):

files_0_set = set(files_0)
files = [f for f in files if f not in files_0_set]

In this case, although we don't specifically want to output the files_0_set, it is a useful intermediate.


By the way, the original code attempts to remove items from a list while iterating over it. This will not work reliably, and may lead to items being missed. It is fine to use an explicit loop if desired, instead of a set comprehension or list comprehension, but the removal of items must be deferred until after the loop.

Additionally, Ronie is correct that removing items from the middle of a list is slow, and therefore it is likely to be faster simply to construct a new list which excludes certain items. However, one situation where you might not want to do that, and therefore should use remove would be if files is a view of a list that you are also using via other references, so you want that list to be updated, not just to reassign the files variable. Here is an example of what you might do in that case. In this example, I will use an explicit loop when constructing files_0 for sake of clarity, although the list comprehension above could also still be used.

files_0 = []
for f in files:
    if os.path.getsize(f) == 0:
        files_0.append(f)

for f in files_0:
    files.remove(f)

The point is that you are doing the removal in a separate loop after the end of the loop over files, and this separate loop is over a different collection (in this case files_0) so you are not removing items from the list that you are iterating over.

alani
  • 12,573
  • 2
  • 13
  • 23