3

I'd like to create python command line code that is able to print directory tree with sizes of all subdirectories (from certain directory) and most frequent extensions... I will show the example output.

  • root_dir (5 GB, jpg (65 %): avi ( 30 %) : pdf (5 %))

-- aa (3 GB, jpg (100 %) )

-- bb (2 GB, avi (20 %) : pdf (2 %) )

--- bbb (1 GB, ...)

--- bb2 (1 GB, ...)

-- cc (1 GB, pdf (100 %) )

The format is :

nesting level, directory name (size of the directory with all files and subdirectories, most frequent extensions with size percentages in this directory.

I have this code snippet so far. The problem is that it counts only file sizes in directory, so the resulting size is smaller than real size of the directory. Other problem is how to put it all together to print the tree I defined above without redundant computations.

Community
  • 1
  • 1
xralf
  • 3,312
  • 45
  • 129
  • 200

4 Answers4

4

Calculating directory sizes really isn't python's strong suit, as explained in this post: very quickly getting total size of folder. If you have access to du and find, by all means use that. You can easily display the size of each directory with the following line:

find . -type d -exec du -hs "{}" \;

If you insist in doing this in python, you may prefer post-order traversal over os.walk, as suggested by PableG. But using os.walk can be visually cleaner, if efficiency is not the utmost factor for you:

import os, sys
from collections import defaultdict

def walkIt(folder):
    for (path, dirs, files) in os.walk(folder):
        size = getDirSize(path)
        stats = getExtensionStats(files)

        # only get the top 3 extensions
        print '%s (%s, %s)'%(path, size, stats[:3])

def getExtensionStats(files):
    # get all file extensions
    extensions = [f.rsplit(os.extsep, 1)[-1] 
        for f in files if len(f.rsplit(os.extsep, 1)) > 1]

    # count the extensions
    exCounter = defaultdict(int)
    for e in extensions:
        exCounter[e] += 1

    # convert count to percentage
    percentPairs = [(e, 100*ct/len(extensions)) for e, ct in exCounter.items()]

    # sort them
    percentPairs.sort(key=lambda i: i[1])
    return percentPairs

def getDirSize(root):
    size = 0
    for path, dirs, files in os.walk(root):
        for f in files:
            size +=  os.path.getsize( os.path.join( path, f ) )
    return size

if __name__ == '__main__':
    path = sys.argv[1] if len(sys.argv) > 1 else '.'
    walkIt(path)
Community
  • 1
  • 1
William Niu
  • 15,798
  • 7
  • 53
  • 93
2

I personally find os.listdir + a_recursive_function best suited for this task than os.walk:

import os, copy
from os.path import join, getsize, isdir, splitext

frequent_ext = { ".jpg": 0, ".pdf": 0 }     # Frequent extensions

def list_dir(base_dir):
    dir_sz = 0  # directory size
    files = os.listdir(base_dir)
    ext_size = copy.copy(frequent_ext)

    for file_ in files:
        file_ = join(base_dir, file_)

        if isdir(file_):
            ret = list_dir(file_)
            dir_sz += ret[0]
            for k, v in frequent_ext.items():           # Add to freq.ext.sizes
                ext_size[k] += ret[1][k]
        else:
            file_sz = getsize(file_)
            dir_sz += file_sz

            ext = os.path.splitext(file_)[1].lower()   # Frequent extension?
            if ext in frequent_ext.keys():
                ext_size[ext] += file_sz

    print base_dir, dir_sz,
    for k, v in ext_size.items():
        print "%s: %5.2f%%" % (k, float(v) / max(1, dir_sz) * 100.),

    print 

    return (dir_sz, ext_size)


base_dir = "e:/test_dir/"
base_dir = os.path.abspath(base_dir)
list_dir(base_dir)
PabloG
  • 25,761
  • 10
  • 46
  • 59
  • Thank you, your code teaches me good things in Python, I tried to use os.walk, but usage of it is quite complicated. Your recursion looks pretty elegant. I'm trying to test it but it gives me [error](http://pastebin.com/ytB9N7s1) – xralf Aug 25 '11 at 12:07
  • I noticed that the directory it crashed is symbolic link. Symbolic links could be avoided if its possible. I'm testing it on Linux (Ubuntu) now, but the main usage will be on Windows 7. – xralf Aug 25 '11 at 12:11
0

@Cldy Is right use os.path

for example os.path.walk will walk depth first through every directory below the argument, and return the files and folders in each directory

Use os.path.getsize to get the sizes and split to get the extensions. Store extensions in a list or dict and count them after going through each

If your are on Linux, I would suggest looking at du instead.

Adam Wagner
  • 15,469
  • 7
  • 52
  • 66
lifeisstillgood
  • 3,265
  • 2
  • 21
  • 22
-2

That's the module you need. And also this.

Mariy
  • 5,746
  • 4
  • 40
  • 57
  • Those would be the most useful modules. +1. Perhaps throw in a dict for keeping track of extensions and sizes. – foosion Aug 21 '11 at 13:06
  • 3
    If you just want to point someone to a module, and not say anything else about it, use a comment. If you're going to answer, at least point them to the specific functions, or give them an idea how to figure it out. (I'm not the downvoter, I'm out of votes for the day, but I agree with it). – agf Aug 21 '11 at 13:38
  • I pointed these modules because I think that they are good enough and the documentation is self-explanatory. Perhaps you are right that this would fit better as a comment. However the docs say everything so I won't change my answer. – Mariy Aug 21 '11 at 13:51