I need to read every file in the directory tree starting from a given root location. I would like to do this as fast as possible using parallelism. I have 48 cores at my disposal and 1 TB ram, so the thread resources are not an issue. I also need to log every file that was read.
I looked at using joblib but am unable to combine joblib with os.walk.
I can think of two ways:
- walk the tree and add all files to a queue or list and have a worker pool of threads dequeue files - best load balancing, maybe more time due to initial walk & queue overhead
- spawn threads and statically assign portions of the tree to each thread - low load balancing, no initial walk, assign directories based on a hash of some sort.
or is there a better way?
EDIT performance of storage is not a concern. assume there is an infinitely fast storage that can handle infinite number of parallel reads
EDIT removed multinode situation to keep the focus on parallel directory walk