0

I am using os.walk to query a directory tree for directories with names that include any strings from a my_list.

Directory tree:

./user/zebra/
./user/zebra/zebra_01/
./user/zebra/zebra_02/
./user/lion/
./user/lion/lion_01/
./user/lion/lion_01/giraffe_02
./user/giraffe/
./user/giraffe/giraffe_01

my_list = [‘zebra’, ‘giraffe’]

My script:

for dirpath, dirnames, filenames in os.walk(<path_to_directory_tree>, topdown=True):
    for folders in dirnames:
        for x in my_list:
            if x in folders:
                source_paths = os.path.join(dirpath, folders)

Output (i.e. print(source_paths)):

./user/zebra/
./user/zebra/zebra_01/
./user/zebra/zebra_02/
./user/lion/lion_01/giraffe_02/
./user/giraffe/
./user/giraffe/giraffe_01

I can then further process this output to retain only the desired paths:

./user/zebra/
./user/lion/lion_01/giraffe_02/
./user/giraffe/

But with a massive directory tree, this method takes a very long time. Therefore, I want to avoid generating and then filtering the initial output by having os.walk stop searching recursively for “my_list” directories once there is a parent path match, such that only the desired path output is generated.

I have seen dirnames[:] = [] used, but this would retain only ./user/giraffe/ (but not ./user/zebra/)

Bot75
  • 179
  • 8
  • It's unclear your desired output. Can you explain the criteria for desired output? Why `./user/lion/lion_01/giraffe_02/` in the desired output ? – Rahul K P Oct 12 '22 at 21:59
  • Sure, sorry about that. I do not want os.walk to generate any path that contains multiple instances of strings in my_list (including multiples of the same string), such that the output is all paths that contain only one instance of any string found in my_list – Bot75 Oct 12 '22 at 22:06

1 Answers1

0

You can take the count element in my_list and filter accordingly,

result = []
my_list = [‘zebra’, ‘giraffe’]
for dirpath, dirnames, filenames in os.walk(<path_to_directory_tree>, topdown=True):
    for folders in dirnames:
        source_path = os.path.join(dirpath, folders)
        if sum(source_path.count(x) for x in my_list) == 1:
            result.append(source_path)

Working demo with a list of files.

In [40]: l
Out[40]: 
['./user/zebra/',
 './user/zebra/zebra_01/',
 './user/zebra/zebra_02/',
 './user/lion/lion_01/giraffe_02/',
 './user/giraffe/',
 './user/giraffe/giraffe_01']

In [41]: for i in l:
    ...:     if sum(i.count(x) for x in my_list) == 1:
    ...:         print(i)
    ...: 
./user/zebra/
./user/lion/lion_01/giraffe_02/
./user/giraffe/

Edit:

for dirpath, dirnames, filenames in os.walk(<path_to_directory_tree>, topdown=True):
    dirnames[:] = [d for d in dirnames if sum(d.count(x) for x in my_list) > 1]
    for folders in dirnames:
        source_path = os.path.join(dirpath, folders)
        result.append(source_path)

dirnames[:] = [d for d in dirnames if sum(d.count(x) for x in my_list) > 1] - This will exclude the directory with multiple occurrences in my_list from the os.walk. Reference

Rahul K P
  • 15,740
  • 4
  • 35
  • 52
  • This does generate the desired output, but I am looking for a way to prevent os.walk from doing an exhaustive recursive search of my directory tree. So, for example, instead of finding ./user/giraffe/giraffe_01 and removing it later, I want os.walk to find ./user/giraffe/ and then stop querying subdirectories of ./user/giraffe/ while still querying other paths until all paths with one (and only one) list element is identified (so I do not want to generate the list of paths containing one or more elements in the list) – Bot75 Oct 12 '22 at 22:31
  • dirnames[:] = [d for d in dirnames if sum(d.count(x) for x in my_list) > 1] returns an empty list and ==1 appears to capture only the dirnames with exact matches to elements in the list. So ./user/lion/lion_01/giraffe_02/ is excluded with my_list = ["giraffe", "zebra"], and (for example) ./user/zebra/zebra_01/ is retained – Bot75 Oct 13 '22 at 16:16