I am writing a feature-collection program which would extract information from over 20000 files with Python 2.7. I store some pre-computed information in a dictionary. In roder to make my prgram faster, I used multiprocess and the big dictionary must be used in these process. (Actually, I just want to use each process to handle a part of the files, every process is actually the same function). This dictionary will not be changed in these processes, it is just a parameter for the function.
I found that each process will create its own address space, each with a copy of this big dictionary. My computer does not such a big memory to store many copy of this dictionary. I wonder if there is way to create a static dict object that can be used by every process? Below is my code of the multiprocess part, pmi_dic is the big dictionary (maybe several GB), it will not be changed in the function get_features.
processNum = 2
pool = mp.Pool(processes = processNum)
fileNum = len(filelist)
offset = fileNum / processNum
for i in range(processNum):
if (i == processNum - 1):
start = i * offset
end = fileNum
else:
start = i * offset
end = start + offset
print str(start) + ' ' + str(end)
pool.apply_async(get_features, args = (df_dic, pmi_dic, start, end, filelist, wordNum))
pool.close()
pool.join()