1

I have to loop through a large number of folder at a given path (My_Path) (around 60 000). Then in each folder in My_Path, I have to check if a file name contains a certain date. I want to know if there is a faster way than looping one by one using the os library. (Around 1 hour)

My_Path:
-   Folder 1
              o File 1
              o File 2
              o File 3
              o …
-   Folder 2
-   Folder 3
-   …
-   Folder 60 000
import os
My_Path = r'\\...\...\...\...'
mylist2 = os.listdir(path)  # give a list of 60000 element
for folder in mylist2:
    mylist = os.listdir(My_Path + folder)  # give the list of all files in each folder
    for file in mylist:
        Check_Function(file)

The actual run takes around one hour and I want to know if there is an optimal solution.

Thanks !!

Adirio
  • 5,040
  • 1
  • 14
  • 26
ssc
  • 11
  • 2
  • Python says that variable and function names should not be capitalized, please use `my_path` and `check_function` instead. Capital letters are reserved for class names. – Adirio Sep 03 '19 at 14:18
  • 1
    https://stackoverflow.com/a/10378046/1622937 – j-i-l Sep 03 '19 at 14:18
  • Possible duplicate of [How can I iterate over files in a given directory?](https://stackoverflow.com/questions/10377998/how-can-i-iterate-over-files-in-a-given-directory) – j-i-l Sep 03 '19 at 14:19
  • 1
    Look at this one: https://stackoverflow.com/questions/3162002/a-faster-way-of-directory-walking-instead-of-os-listdir – fukanchik Sep 03 '19 at 14:19

2 Answers2

2

Try os.walk(), might be faster:

import os
My_Path = r'\\...\...\...\...'
for path, dirs, files in os.walk(My_Path): 
    for file in files:
        Check_Function(os.path.join(path, file))

If it is not, maybe it's your Check_Function eating up the cycles.

ilmiacs
  • 2,566
  • 15
  • 20
  • 1
    `walk` uses [`scandir`](https://www.python.org/dev/peps/pep-0471/) on recent versions of Python, which (AFAICT) should be significantly faster on network drives under Windows – Sam Mason Sep 03 '19 at 15:09
2

As others already suggested you can obtain the list lst of all files as in this answer. Then you can spin your function with multiprocessing.

import multiprocessing as mp


def parallelize(fun, vec, cores):
    with mp.Pool(cores) as p:
        res = p.map(fun, vec)
    return res

and run

res = parallelize(Check_Function, lst, mp.cpu_count()-1)

Update Given that I don't think Check_Function is cpu bounded you can use even more cores.

Dharman
  • 30,962
  • 25
  • 85
  • 135
rpanai
  • 12,515
  • 2
  • 42
  • 64
  • 1
    @sam Please post an alternative answer of your own. This answer is not a community wiki answer and you shouldn't be making such serious changes. – Dharman Apr 04 '22 at 20:22