4

Lets say my structure is like this

/-- am here
/one/some/dir
/two
/three/has/many/leaves
/hello/world

and say /one/some/dir contains a big file, 500mb, and /three/has/many/leaves contains a 400mb file in each folder.

I want to generate the size for each directory, to have this output

/ - in total for all
/one/some/dir 500mb
/two 0 
/three/has/many/leaved - 400mb
/three/has/many 800
/three/has/ 800+someotherbigfilehere

How would I go about this?

rapadura
  • 5,242
  • 7
  • 39
  • 57
  • I'm trying to understand your question. How does the output you're looking for differ from the output of `du -h .`? – mgilson Sep 18 '12 at 15:44
  • I want the output of du . Yes. In python. Without usinb subprocesses or executing du. – rapadura Sep 18 '12 at 15:45
  • Seriously.. GOOGLE. http://stackoverflow.com/questions/120656/directory-listing-in-python http://stackoverflow.com/questions/2104080/how-to-check-file-size-in-python – desimusxvii Sep 18 '12 at 15:49
  • @desimusxvii Ive read all I can find and still cant figure it out. I can get the total of _a_ folder, but I want the total of each folder in the folder. none of those examples do what I want, and it isnt clear if its even possible with os.walk or if I need to do it with os.listdir – rapadura Sep 18 '12 at 15:56
  • 2
    I just linked you a way to traverse the files and a way to get the size of the file. All you have to do is add them up! You might have to actually code a little bit on your own. Sorry. – desimusxvii Sep 18 '12 at 15:58
  • @desimusxvii yes, I can add them up too, then I get a total under the path that is walked. Thats not what I want. I want the total for each folder that is walked. – rapadura Sep 18 '12 at 15:58
  • 4
    Your comments here and to answers show a reluctance if not a refusal to do any coding of your own. If you're struggling to get a particular behavior taking into account all the sample code linked and/or provided, show us what you've *actually* used and we can help point in the right direction...if your attitude is to repeatedly suggest that "this problem is unique in the world, so write it for me", you're unlikely to get that. – hexparrot Sep 18 '12 at 16:13
  • Ive modified all the examples in every possible way atleast I could think of without going insane, and no, the only answer in this question is still the first thing i read, help(os.walk). @hexparrot but oh yeah "its easy enough". – rapadura Sep 18 '12 at 16:37
  • 1
    If you've made such modifications, why haven't you shown them and explained where you got stuck? Either way, blindly and randomly modifying code until it does what you want is not the way to program. – David Robinson Sep 18 '12 at 16:48

6 Answers6

11

Have a look at os.walk. Specifically, the documentation has an example to find the size of a directory:

import os
from os.path import join, getsize
for root, dirs, files in os.walk('python/Lib/email'):
    print root, "consumes",
    print sum(getsize(join(root, name)) for name in files),
    print "bytes in", len(files), "non-directory files"
    if 'CVS' in dirs:
        dirs.remove('CVS')  # don't visit CVS directories

This should be easy enough to modify for your purposes.


Here's an untested version in response to your comment:

import os
from os.path import join, getsize
dirs_dict = {}

#We need to walk the tree from the bottom up so that a directory can have easy
# access to the size of its subdirectories.
for root, dirs, files in os.walk('python/Lib/email',topdown = False):

    # Loop through every non directory file in this directory and sum their sizes
    size = sum(getsize(join(root, name)) for name in files) 

    # Look at all of the subdirectories and add up their sizes from the `dirs_dict`
    subdir_size = sum(dirs_dict[join(root,d)] for d in dirs)

    # store the size of this directory (plus subdirectories) in a dict so we 
    # can access it later
    my_size = dirs_dict[root] = size + subdir_size

    print '%s: %d'%(root,my_size) 
mgilson
  • 300,191
  • 65
  • 633
  • 696
  • It doesnt seem easy enough for me. I wouldnt be asking if I hadnt read the documentation and searched everything I could find. – rapadura Sep 18 '12 at 15:55
  • @Antonioo -- Doesn't this do what you want if you remove the `if 'CVS' in dirs` bit? – mgilson Sep 18 '12 at 16:00
  • no it doesnt, it gives different output than du ., wrong sizes. It gives the total file-sizes in /one but I want total file-sizes + all the subfolders sizes of /one. – rapadura Sep 18 '12 at 16:04
  • If the example code finds all the subfolders in your hierarchy and returns the proper sizes for those subfolders, then how hard can it be for you to add those sizes to the containing folder? As @hexparrot says, you'll need to put a little thought into how you would use the suggested tool to give the results you want. If you can take the output of the example code and derive the results you want using a pencil and paper, then try writing some code that duplicates what you're doing using the pencil. – Dave Sep 18 '12 at 16:53
  • 1
    @Dave, how hard can it be, well not everybody is a genius so, yeah its quite hard. the documentation for os.walk is also hard to read, its especially frustrating when everyone tells me this is simple, and I see everywhere on the internet people just spew the same code without actually explaining what it does and why it works. And the above in this answer doesnt work either and no explanation is given, but now anyway I am a step closer as I see I need a dict to keep each subfolders mapped to its size. And also du . is giving me different results for an empty dir,while the getsize gives 0. – rapadura Sep 18 '12 at 17:07
  • 1
    @Antonioo -- My apologies. There was a slight logic error in my edit (it didn't take *all* subdirectories into account -- It only took 1 level). I've updated. Also, I've commented that example pretty heavily to hopefully address the "lack of explanation". – mgilson Sep 18 '12 at 17:14
  • @mgilson thank you, I get it now. I approached it from the top-down approach and got stuck heavily, but now I see it, bottom-up and stick the results in a dict! Thanks! – rapadura Sep 18 '12 at 18:33
  • Note _getsize_ follows symbolic links. It can fail if there is a broken link, and you may not want to count the linked file in your results. _os.lstat(filename).st_size_ returns the size of the symlink itself. – gerardw Dec 19 '22 at 14:14
1

Actually @mgilson answer is not working if there are symbolic links in the directories. To allow that you have to do that instead :

dirs_dict = {}
for root, dirs, files in os.walk(directory, topdown=False):
    if os.path.islink(root):
        dirs_dict[root] = 0L
    else:
        dir_size = getsize(root)

        # Loop through every non directory file in this directory and sum their sizes
        for name in files:
             full_name = join(root, name)
             if os.path.islink(full_name):
                 nsize = 0L
             else:
                 nsize = getsize(full_name)
             dirs_dict[full_name] = nsize
             dir_size += nsize

        # Look at all of the subdirectories and add up their sizes from the `dirs_dict`
        subdir_size = 0L
        for d in dirs:
            full_d = join(root, d)
            if os.path.islink(full_d):
                dirs_dict[full_d] = 0L
            else:
                subdir_size += dirs_dict[full_d]

        dirs_dict[root] = dir_size + subdir_size
Thomas Leonard
  • 1,047
  • 11
  • 25
1

The following script prints directory size of all sub-directories for the specified directory. This script should be independent from the platform - Posix/Windows/etc. It also tries to benefit (if possible) from caching the calls of a recursive functions. If an argument is omitted, the script will work in the current directory. The output is sorted by the directory size from biggest to smallest ones. So you can adapt it for your needs.

PS i've used recipe 578019 for showing directory size in human-friendly format

from __future__ import print_function
import os
import sys
import operator

def null_decorator(ob):
    return ob

if sys.version_info >= (3,2,0):
    import functools
    my_cache_decorator = functools.lru_cache(maxsize=4096)
else:
    my_cache_decorator = null_decorator

start_dir = os.path.normpath(os.path.abspath(sys.argv[1])) if len(sys.argv) > 1 else '.'

@my_cache_decorator
def get_dir_size(start_path = '.'):
    total_size = 0
    if 'scandir' in dir(os):
        # using fast 'os.scandir' method (new in version 3.5)
        for entry in os.scandir(start_path):
            if entry.is_dir(follow_symlinks = False):
                total_size += get_dir_size(entry.path)
            elif entry.is_file(follow_symlinks = False):
                total_size += entry.stat().st_size
    else:
        # using slow, but compatible 'os.listdir' method
        for entry in os.listdir(start_path):
            full_path = os.path.abspath(os.path.join(start_path, entry))
            if os.path.islink(full_path):
                continue
            if os.path.isdir(full_path):
                total_size += get_dir_size(full_path)
            elif os.path.isfile(full_path):
                total_size += os.path.getsize(full_path)
    return total_size

def get_dir_size_walk(start_path = '.'):
    total_size = 0
    for dirpath, dirnames, filenames in os.walk(start_path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            total_size += os.path.getsize(fp)
    return total_size

def bytes2human(n, format='%(value).0f%(symbol)s', symbols='customary'):
    """
    (c) http://code.activestate.com/recipes/578019/

    Convert n bytes into a human readable string based on format.
    symbols can be either "customary", "customary_ext", "iec" or "iec_ext",
    see: https://en.wikipedia.org/wiki/Binary_prefix#Specific_units_of_IEC_60027-2_A.2_and_ISO.2FIEC_80000

      >>> bytes2human(0)
      '0.0 B'
      >>> bytes2human(0.9)
      '0.0 B'
      >>> bytes2human(1)
      '1.0 B'
      >>> bytes2human(1.9)
      '1.0 B'
      >>> bytes2human(1024)
      '1.0 K'
      >>> bytes2human(1048576)
      '1.0 M'
      >>> bytes2human(1099511627776127398123789121)
      '909.5 Y'

      >>> bytes2human(9856, symbols="customary")
      '9.6 K'
      >>> bytes2human(9856, symbols="customary_ext")
      '9.6 kilo'
      >>> bytes2human(9856, symbols="iec")
      '9.6 Ki'
      >>> bytes2human(9856, symbols="iec_ext")
      '9.6 kibi'

      >>> bytes2human(10000, "%(value).1f %(symbol)s/sec")
      '9.8 K/sec'

      >>> # precision can be adjusted by playing with %f operator
      >>> bytes2human(10000, format="%(value).5f %(symbol)s")
      '9.76562 K'
    """
    SYMBOLS = {
        'customary'     : ('B', 'K', 'M', 'G', 'T', 'P', 'E', 'Z', 'Y'),
        'customary_ext' : ('byte', 'kilo', 'mega', 'giga', 'tera', 'peta', 'exa',
                           'zetta', 'iotta'),
        'iec'           : ('Bi', 'Ki', 'Mi', 'Gi', 'Ti', 'Pi', 'Ei', 'Zi', 'Yi'),
        'iec_ext'       : ('byte', 'kibi', 'mebi', 'gibi', 'tebi', 'pebi', 'exbi',
                           'zebi', 'yobi'),
    }
    n = int(n)
    if n < 0:
        raise ValueError("n < 0")
    symbols = SYMBOLS[symbols]
    prefix = {}
    for i, s in enumerate(symbols[1:]):
        prefix[s] = 1 << (i+1)*10
    for symbol in reversed(symbols[1:]):
        if n >= prefix[symbol]:
            value = float(n) / prefix[symbol]
            return format % locals()
    return format % dict(symbol=symbols[0], value=n)

############################################################
###
###  main ()
###
############################################################
if __name__ == '__main__':
    dir_tree = {}
    ### version, that uses 'slow' [os.walk method]
    #get_size = get_dir_size_walk
    ### this recursive version can benefit from caching the function calls (functools.lru_cache)
    get_size = get_dir_size

    for root, dirs, files in os.walk(start_dir):
        for d in dirs:
            dir_path = os.path.join(root, d)
            if os.path.isdir(dir_path):
                dir_tree[dir_path] = get_size(dir_path)

    for d, size in sorted(dir_tree.items(), key=operator.itemgetter(1), reverse=True):
        print('%s\t%s' %(bytes2human(size, format='%(value).2f%(symbol)s'), d))

    print('-' * 80)
    if sys.version_info >= (3,2,0):
        print(get_dir_size.cache_info())

Sample output:

37.61M  .\subdir_b
2.18M   .\subdir_a
2.17M   .\subdir_a\subdir_a_2
4.41K   .\subdir_a\subdir_a_1
----------------------------------------------------------
CacheInfo(hits=2, misses=4, maxsize=4096, currsize=4)
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
0

I achieved this with this code:

def get_dir_size(path=os.getcwd()):

    total_size = 0
    for dirpath, dirnames, filenames in os.walk(path):

        dirsize = 0
        for f in filenames:
            fp = os.path.join(dirpath, f)
            size = os.path.getsize(fp)
            #print('\t',size, f)
            #print(dirpath, dirnames, filenames,size)
            dirsize += size
            total_size += size
        print('\t',dirsize, dirpath)
    print(" {0:.2f} Kb".format(total_size/1024))
0

I achieved this using the pathlib module. The following code will calculate correct directory sizes for every sub-directory in a given directory tree.


Note: If you wish to calculate the total size of the given root directory only and not for all the individual sub-directories using this code, then you must get rid of the outer loop, i.e. - for sub in subdir: and replace ls = list(sub.rglob('*.*')) with ls = list(dir_path.rglob('*.*')) and correct the indentation accordingly.


So, here's the sample code generated using Python 3.7.6 on Windows.

import os 
from pathlib import Path

# Set home/root path
dir_path = Path('//?/C:/Downloads/.../.../.../.../...')

# IMP_NOTE: If the path is 265 characters long, which exceeds the classic MAX_PATH - 1 (259) character
# limit for DOS paths. Use an extended (verbatim) path such as "\\\\?\\C:\\" in order 
# to access the full length that's supported by the filesystem -- about 32,760 characters. 
# Alternatively, use Windows 10 with Python 3.6+ and enable long DOS paths in the registry.

# pathlib normalizes Windows paths to use backslash, so we can use
# Path('//?/D:/') without having to worry about escaping backslashes.

# Generate a complete list of sub-directories
subdir = list(x for x in dir_path.rglob('*') if x.is_dir())

for sub in subdir:
    tot_dir_size = 0
    ls = list(sub.rglob('*.*'))
    # print(sub, '\n')
    # print(len(ls), '\n')
    for k in ls:
        tot_dir_size += os.path.getsize(k)
    # print(format(tot_dir_size, ',d'))
    print("For Sub-directory: " + sub.parts[-1] + "   ===>   " + 
          "Size = " + str(format(tot_dir_size, ',d')) + "\n")

# path.parts ==> Provides a tuple giving access to the path’s various components
# (Ref.: pathlib documentation)


Output:



For Sub-directory: DIR_1   ===>   Size = 5,600,621,618

For Sub-directory: DIR_2   ===>   Size = 9,113,492,347

For Sub-directory: DIR_3   ===>   Size = 928,986,489

For Sub-directory: DIR_4   ===>   Size = 2,125,250,470

Amar
  • 111
  • 3
0

Use recursive_size

pip install recursive-size

Then do (from their own documentation)

from recursive_size import get_size
size = get_size('path/to/folder')
print(size)
surge10
  • 622
  • 6
  • 18