3

I am trying to get all grandchildren of a certain directory in Python. For performance reasons I don't want to keep calling OS functions in a loop (it's a network filesystem). This is what I have at the moment. Is there a simpler way to do this?

dirTree = os.walk(root)
children = [os.path.join(root, x) for x in dirTree.next()[1]]
grandChildren = []
for root, dirs, files in dirTree:
    if root in children:
        for dir in dirs:
            grandChildren.append(os.path.join(root, dir))

EDIT: I'm not clear on whether my call to os.walk is lazy or not. My intention is that the whole tree should be in memory after my call but I'm not sure about it.

jjujuma
  • 22,055
  • 12
  • 44
  • 46

2 Answers2

5

if i got your question right.

you can use glob to get files or directories, by giving wildcard notations. for example to get all dir inside "/home/" in a list you can do.

glob.glob('/home/*/*/')

or to know all the files as well you can do

glob.glob('/home/*/*')
vikalp.sahni
  • 359
  • 3
  • 6
  • This is not very useful, as you would have to know the number of subdirectories in each directory. – msvalkon Mar 06 '13 at 20:21
  • jjujuma: `grandChildren = [dirpath.rstrip(os.sep) for dirpath in glob.iglob('/home/*/*/')]` produces the same list as your code. The `rstrip()` removes the trailing path separator on directory paths in the list. @msvalkon: I believe you're mistaken. – martineau Mar 06 '13 at 22:46
  • @martineau ah yes I understood `grandChildren` as all the subdirectories of `/foo/bar`. – msvalkon Mar 06 '13 at 22:52
1

In POSIX nor in Windows, you can't get all of that data in one OS call. At a minimum, for POSIX, there will be three per directory (opendir, readdir, close), plus one per directory entry (stat).


I believe that what follows will result in fewer OS calls than what you posted. Yes, the os.walk() call is lazy; that is, the entire tree is not in memory upon the return from walk(), but is rather read in piecemeal during the calls to next().

Thus, my version will read in only the 1st-order descendants directories, and will stat only the immediate children and grandchildren. Your version will do that work for all of the great-grandchildren as well, for as deep as your directory structure is.

root='.'
grandChildren = []
for kid in next(os.walk('.'))[1]:
  x = next(os.walk(os.path.join('.', kid)))
  for grandKid in x[1]:  # (or x[1]+x[2] if you care about regular files)
    grandChildren.append(os.path.join(x[0], grandKid))

Or, as a list comprehension instead of a for loop:

import os
root='.'
grandChildren = [
  os.path.join(kid, grandKid)
  for kid in next(os.walk(root))[1]
    for grandKid in next(os.walk(os.path.join(root, kid)))[1]]

Finally, factoring out the os.walks into a function:

def read_subdirs(dir='.'):
  import os
  return (os.path.join(dir,x) for x in next(os.walk(dir))[1])

root='.'
grandChildren = [
  grandKid
  for kid in read_subdirs(root)
    for grandKid in read_subdirs(kid)]


From testing, we can see that my version calls stat many fewer times than your does if there are great-grandchildren.

In my home directory, for example, I ran my code (/tmp/a.py) and yours (/tmp/b.py) with root set to '.' in each case:

$ strace -e stat python /tmp/a.py 2>&1 > /dev/null | egrep -c stat
1245
$ strace -e stat python /tmp/b.py 2>&1 > /dev/null | egrep -c stat
36049
Robᵩ
  • 163,533
  • 20
  • 239
  • 308