How to get all grandchildren of a directory in python with only one OS call

Question

I am trying to get all grandchildren of a certain directory in Python. For performance reasons I don't want to keep calling OS functions in a loop (it's a network filesystem). This is what I have at the moment. Is there a simpler way to do this?

dirTree = os.walk(root)
children = [os.path.join(root, x) for x in dirTree.next()[1]]
grandChildren = []
for root, dirs, files in dirTree:
    if root in children:
        for dir in dirs:
            grandChildren.append(os.path.join(root, dir))

EDIT: I'm not clear on whether my call to os.walk is lazy or not. My intention is that the whole tree should be in memory after my call but I'm not sure about it.

[This answer](http://stackoverflow.com/a/4117594/8747) might shed light. — Robᵩ, Mar 06 '13 at 20:18
Are you interested in only grandchild, or all descendants of 2nd and great generations? That is, would 'root/a/b/c/d' be included or excluded from your search? — Robᵩ, Mar 06 '13 at 20:21

score 5 · Accepted Answer · answered Mar 06 '13 at 20:17

5

if i got your question right.

you can use glob to get files or directories, by giving wildcard notations. for example to get all dir inside "/home/" in a list you can do.

glob.glob('/home/*/*/')

or to know all the files as well you can do

glob.glob('/home/*/*')

answered Mar 06 '13 at 20:17

vikalp.sahni

359
3
6

This is not very useful, as you would have to know the number of subdirectories in each directory. – msvalkon Mar 06 '13 at 20:21
jjujuma: `grandChildren = [dirpath.rstrip(os.sep) for dirpath in glob.iglob('/home/*/*/')]` produces the same list as your code. The `rstrip()` removes the trailing path separator on directory paths in the list. @msvalkon: I believe you're mistaken. – martineau Mar 06 '13 at 22:46
@martineau ah yes I understood `grandChildren` as all the subdirectories of `/foo/bar`. – msvalkon Mar 06 '13 at 22:52

Robᵩ · Answer 2 · 2013-03-06T21:01:48.703

In POSIX nor in Windows, you can't get all of that data in one OS call. At a minimum, for POSIX, there will be three per directory (opendir, readdir, close), plus one per directory entry (stat).

I believe that what follows will result in fewer OS calls than what you posted. Yes, the os.walk() call is lazy; that is, the entire tree is not in memory upon the return from walk(), but is rather read in piecemeal during the calls to next().

Thus, my version will read in only the 1st-order descendants directories, and will stat only the immediate children and grandchildren. Your version will do that work for all of the great-grandchildren as well, for as deep as your directory structure is.

root='.'
grandChildren = []
for kid in next(os.walk('.'))[1]:
  x = next(os.walk(os.path.join('.', kid)))
  for grandKid in x[1]:  # (or x[1]+x[2] if you care about regular files)
    grandChildren.append(os.path.join(x[0], grandKid))

Or, as a list comprehension instead of a for loop:

import os
root='.'
grandChildren = [
  os.path.join(kid, grandKid)
  for kid in next(os.walk(root))[1]
    for grandKid in next(os.walk(os.path.join(root, kid)))[1]]

Finally, factoring out the os.walks into a function:

def read_subdirs(dir='.'):
  import os
  return (os.path.join(dir,x) for x in next(os.walk(dir))[1])

root='.'
grandChildren = [
  grandKid
  for kid in read_subdirs(root)
    for grandKid in read_subdirs(kid)]

From testing, we can see that my version calls stat many fewer times than your does if there are great-grandchildren.

In my home directory, for example, I ran my code (/tmp/a.py) and yours (/tmp/b.py) with root set to '.' in each case:

$ strace -e stat python /tmp/a.py 2>&1 > /dev/null | egrep -c stat
1245
$ strace -e stat python /tmp/b.py 2>&1 > /dev/null | egrep -c stat
36049

How to get all grandchildren of a directory in python with only one OS call

2 Answers2