Construct a tree from list os file paths (Python) - Performance dependent

Question

Hey I am working on a very high performance file-managing/analyzing toolkit written in python. I want to create a function that gives me a list or something like that in a tree format. Something like in this question (java-related)

From:

dir/file
dir/dir2/file2
dir/file3
dir3/file4
dir3/file5

Note: the list of paths is unsorted

To:

dir/
    file
    dir2/
        file2
    file3
dir3/
    file4
    file5

[[dir, [file, [dir2, [file2]], file3]], [dir3, [file4, file5]]]

something along those lines. I've been playing around with some ideas but none of them provided the speed that I would like to have.

Note: I do already have the list of paths, so no worrying about that. The function takes paths list and gives tree list.

Thanks in Advance

*« I am working on a **very high performance** file-managing/analyzing toolkit **written in python**.»*... Ouch! :( — mac, Dec 13 '11 at 07:21
haha i know, i want something reasonable implemented first and latter on i am gonna use a more low-level implementation with cython etc... — cwoebker, Dec 13 '11 at 15:10

score 23 · Accepted Answer · answered Dec 13 '11 at 22:03

Now that you clarified the question a bit more, I guess the following is what you want:

from collections import defaultdict

input_ = '''dir/file
dir/dir2/file2
dir/file3
dir2/alpha/beta/gamma/delta
dir2/alpha/beta/gamma/delta/
dir3/file4
dir3/file5'''

FILE_MARKER = '<files>'

def attach(branch, trunk):
    '''
    Insert a branch of directories on its trunk.
    '''
    parts = branch.split('/', 1)
    if len(parts) == 1:  # branch is a file
        trunk[FILE_MARKER].append(parts[0])
    else:
        node, others = parts
        if node not in trunk:
            trunk[node] = defaultdict(dict, ((FILE_MARKER, []),))
        attach(others, trunk[node])

def prettify(d, indent=0):
    '''
    Print the file tree structure with proper indentation.
    '''
    for key, value in d.iteritems():
        if key == FILE_MARKER:
            if value:
                print '  ' * indent + str(value)
        else:
            print '  ' * indent + str(key)
            if isinstance(value, dict):
                prettify(value, indent+1)
            else:
                print '  ' * (indent+1) + str(value)



main_dict = defaultdict(dict, ((FILE_MARKER, []),))
for line in input_.split('\n'):
    attach(line, main_dict)

prettify(main_dict)

It outputs:

dir3
  ['file4', 'file5']
dir2
  alpha
    beta
      gamma
        ['delta']
        delta
          ['']
dir
  dir2
    ['file2']
  ['file', 'file3']

A few thing to note:

The script make heavy use of defaultdicts, basically this allows to skip checking for the existence of a key and its initialisation if it is not there
Directory names are mapped to dictionary keys, I thought this might be a good feature for you, as key are hashed and you will able to retrieve information much faster this way than with lists. You can access the hierarchy in the form main_dict['dir2']['alpha']['beta']...
Note the difference between .../delta and .../delta/. I thought this was helpful for you to be able to quickly differenciate between your leaf being a directory or a file.

I hope this answers your question. If anything is unclear, post a comment.

alright thanks this helps a lot and works fine. i am probably gonna change the way i index files so i can make everything thats based of it more efficient. but yeah thanks a lot really appreciate it! — cwoebker, Dec 14 '11 at 01:51
@cwoebker - No problem, happy it helped! Good luck with your problem! — mac, Dec 14 '11 at 09:04
Thank you for the code / how-to @mac , it works great with me but I would like my tree to be alphabetically sorted at each level. I found this : [link]http://stackoverflow.com/questions/21024969/sorting-a-dictionary-both-hierarchically-and-alphabetically-python/21025140#21025140[link] but I don't get it working w/ this tree. Can anyone explain me how I can use correctly the 'sorted' function or point to me some 'newbie' courses on this. Thank you ! — corentino, Jun 07 '15 at 01:10
this is awesome sauce. But why does it unsort my sorted input? — thistleknot, Nov 06 '21 at 13:43

score 1 · Answer 2 · answered Dec 13 '11 at 07:22

I'm not fully clear on what you have vs what you need (it'd probably help to provide some of the code you have that's too slow), but you probably should just break up your pathnames into dirnames and basenames, then build a tree from that using a purpose-made class, or at least a hierarchy of lists or dictionaries. Then various traversals should allow you to serialize in almost any way you require.

As to the performance issues, have you considered using Pypy, Cython or Shedskin? I have a deduplicating backup system I've been working on for fun, that can run the same code on Pypy or Cython; running it on Pypy actually outperforms the Cython-augmented version (by a lot on 32 bit, by a little bit on 64 bit). I'd love to compare shedskin as well, but it apparently can't yield across the shedskin/cpython boundary.

Also, profiling is de rigueur when you have performance issues - at least, if you've already selected an appropriate algorithm.

Cython was planned for later on. – cwoebker Dec 13 '11 at 16:23 — cwoebker, Dec 13 '11 at 16:23

score 0 · Answer 3 · answered Dec 13 '11 at 07:56

First off, "very hight performance" and "Python" don't mix well. If what you are looking for is optimising performance to the extreme, switching to C will bring you benefits far superior to any smart code optimisation that you might think of.

Secondly, it's hard to believe that the bottleneck in a "file-managing/analyzing toolkit" will be this function. I/O operations on disk are at least a few order of magnitude slower than anything happening in memory. Profiling your code is the only accurate way to gauge this but... I'm ready to pay you a pizza if I'm wrong! ;)

I built a silly test function just to perform some preliminary measurement:

from timeit import Timer as T

PLIST = [['dir', ['file', ['dir2', ['file2']], 'file3']], ['dir3', ['file4', 'file5', 'file6', 'file7']]]

def tree(plist, indent=0):
    level = []
    for el in plist:
        if isinstance(el, list):
            level.extend(tree(el, indent + 2))
        else:
            level.append(' ' * indent + el)
    return level

print T(lambda : tree(PLIST)).repeat(number=100000)

This outputs:

[1.0135619640350342, 1.0107290744781494, 1.0090651512145996]

Since the test path list is 10 files, and the number of iterations is 100000 this means that in 1 second you can process a tree of about 1 million files. Now... unless you are working at Google, that seems an acceptable result to me.

By contrast, when I started writing this answer, I clicked on the "property" option on the root of my main 80Gb HD [this should be giving me the number of files on it, using C code]. A few minutes are gone, and I'm at around 50 GB, 300000 files...

HTH! :)

this is indeed pretty fast but irrelevant to my question, i need to analyze all my paths with regex or something and then get a tree structure like you used in your input — cwoebker, Dec 13 '11 at 15:26
@cwoebker - You should probably clarify your question then, as it is now, it's not clear what are your input and expected output. It would also help to know if you have access to the filesystem and can read the file names from it instead, or if you must use the input you hopefully are going to exemplify. :) — mac, Dec 13 '11 at 15:55
i added example input... i do have access to the filesystem. in the project i have an index of the files in a redis instance, i read the paths in from there, so i would not prefer to accesss the files again — cwoebker, Dec 13 '11 at 16:17
I think mentioning switching to c when talking about high performance in python is a cliche. everybody knows that yet you don't see everyone switching back to c. every language has its own purpose and its very clear that the OP meant the fastest algorithm i.e. that works in the least order of time implemented in Python. — Ishan Srivastava, Aug 05 '18 at 03:42

Construct a tree from list os file paths (Python) - Performance dependent

3 Answers3

Linked

Related