3

os.walk has a helpful example:

import os
from os.path import join, getsize
for root, dirs, files in os.walk('python/Lib/email'):
    print(root, "consumes", end=" ")
    print(sum(getsize(join(root, name)) for name in files), end=" ")
    print("bytes in", len(files), "non-directory files")
    if 'CVS' in dirs:
        dirs.remove('CVS')  # don't visit CVS directories

Despite the note that os.walk got faster in Python 3.5 by switching to os.scandir, this doesn't mention that it's still a sub-optimal implementation on Windows.

https://www.python.org/dev/peps/pep-0471/ does describe this & gets it almost right. However, it recommends using recursion. When dealing with arbitrary folder structures, this doesn't work so well as you'll quickly hit Python recursion limits (you'll only be able to iterate a folder structure up to 1000 folders deep, which if you're starting at the root of the filesystem isn't necessarily unrealistic. The real limit isn't actually 1000. It's 1000 - your Python call depth when you go to run this function. If you're doing this in response to a web service request through Django with lots of business logic layers, it wouldn't be unrealistic to get close to this limit easily.

Vitali
  • 3,411
  • 2
  • 24
  • 25
  • 1
    You're seriously troubled about recursion limit here? How deeply are your directories nested? – superb rain Aug 19 '20 at 02:13
  • Why not just use `du -s` on linux or `dir /S` on windows and parse the output? – Iain Shelvington Aug 19 '20 at 02:17
  • Out of curiosity, I checked my entire C: drive. Deepest level is 19. – superb rain Aug 19 '20 at 02:35
  • 1
    @IainShelvington Native python code is far more cross-platform than invoking arbitrary scripts. `dir /S` can be a security concern since you'd need to also pass `shell=True` which means the directory you are trying to iterate can become a source of shell injection attacks. Even in certain Linux environments you may not have a `du` script to shell out to. Native cross-platform code is generally preferable to shelling out to random utilities that aren't really guaranteed to be available. – Vitali Aug 21 '20 at 17:05
  • @superbrain yes I am. There are two reasons. The first is that "my C: drive has the deepest level of 19" isn't a good barometer for determining anything. I've easily exceeded the maximum path length on automated build systems more than once in my career. Realistically you probably won't have 1000 directories but also you don't have guarantees on where in the call stack this would get called from. So I don't think it's so extraordinary for the sum of it to be a potential concern. – Vitali Aug 21 '20 at 17:14
  • On fast filesystems (i.e. RAM) the non-recursive variant should be significantly faster as well & use significantly less memory due to not having recursion. Tail recursion optimization in the Python compiler/interpreter would also fix the need for rewriting it as non-recursive. – Vitali Aug 21 '20 at 17:15

1 Answers1

3

The following snippet should be optimal on all operating systems & handle any folder structure you throw at it. Memory usage will obviously grow the more folders you encounter but to my knowledge there's nothing you can really do about that as you somehow have to keep track of where you need to go.

def get_tree_size(path):
    total_size = 0
    dirs = [path]
    while dirs:
        next_dir = dirs.pop()
        with os.scandir(next_dir) as it:
            for entry in it:
                if entry.is_dir(follow_symlinks=False):
                    dirs.append(entry.path)
                else:
                    total_size += entry.stat(follow_symlinks=False).st_size
    return total_size

It's possible using a collections.deque may speed up operations vs the naiive usage of a list here but I suspect it would be hard to write a benchmark to show this with disk speeds what they are today.

Vitali
  • 3,411
  • 2
  • 24
  • 25
  • 1
    Why would a deck perform better? – juanpa.arrivillaga Aug 19 '20 at 02:25
  • Deque are double-ended queues. They have the advantages of lists in terms of spacial locality and the advantages of lists in being able to pop off the front without needing to shift all the subsequent elements (which you would for a python list as far I'm aware what's happening given that popping any arbitrary element is O(N) - https://wiki.python.org/moin/TimeComplexity). Another solution would be to drop the last element of the directory instead of the first which would change the iteration from LIFO – Vitali Aug 21 '20 at 17:02
  • 1
    But you *are* popping off the end, `next_dir = dirs.pop()`, which is constant time. – juanpa.arrivillaga Aug 24 '20 at 03:03
  • `pop` dos seem to be mildly faster: https://stackoverflow.com/questions/23487307/python-deque-vs-list-performance-comparison Again, in practice I doubt it'll ever matter. – Vitali Mar 31 '22 at 18:08