Say I need to read from disk a large data and do some read-only work on it.
I need to use multiprocessing, but to share it across processes using multiprocessing.Manager()
or Array()
is way too slow. Since my operation on this large data is read-only, according to this answer, I can declare this large data in the global scope, and then each child process has its own large data in the memory:
# main.py
import argparse
import numpy as np
import multiprocessing as mp
import time
parser = argparse.ArgumentParser()
parser.add_argument('-p', '--path', type=str)
args = parser.parse_args()
print('loading data from disk... may take a long time...')
global_large_data = np.load(args.path)
def worker(row_id):
# some stuff read-only to the global_large_data
time.sleep(0.01)
print(row_id, np.sum(global_large_data[row_id]))
def main():
pool = mp.Pool(mp.cpu_count())
pool.map(worker, range(global_large_data.shape[0]))
pool.close()
pool.join()
if __name__ == '__main__':
main()
And in terminal,
$ python3 main.py -p /path/to/large_data.npy
This is fast, and almost good to me. However, one shortcoming is that each child process needs to reload the large file from disk, and the loading process can waste a lot of time.
Is there any way (e.g., wrapper) so that only the parent process loads the file from disk once, and then directly send the copy to each child process's memory?
Note that my memory space is abundant -- many copies of this large data in memory is good. I just don't want to reload it from disk for many times.