I am trying to read and process 1000s of files, but unfortunately it takes about 3x as long to process the file as it does to read it in from disk, so I would like to process these files as they are read in (and while I am continuing to read in additional files).
In a perfect world, I have a generator which reads one file at a time, and I would like to pass this generator to a pool of workers which process items from the generator as they are (slowly) generated.
Here's an example:
def process_file(file_string):
...
return processed_file
pool = Pool(processes=4)
path = 'some/path/'
results = pool.map(process_file, (open(path+part,'rb').read() for part in os.listdir(path)))
the only issue with the code above is that all the files are read into memory before the pool begins, which means that I need to wait for the disk to read everything in, and I also consume a large amount of memory.