0

I am writing a feature-collection program which would extract information from over 20000 files with Python 2.7. I store some pre-computed information in a dictionary. In roder to make my prgram faster, I used multiprocess and the big dictionary must be used in these process. (Actually, I just want to use each process to handle a part of the files, every process is actually the same function). This dictionary will not be changed in these processes, it is just a parameter for the function.

I found that each process will create its own address space, each with a copy of this big dictionary. My computer does not such a big memory to store many copy of this dictionary. I wonder if there is way to create a static dict object that can be used by every process? Below is my code of the multiprocess part, pmi_dic is the big dictionary (maybe several GB), it will not be changed in the function get_features.

processNum = 2
pool = mp.Pool(processes = processNum)
fileNum = len(filelist)
offset = fileNum / processNum
for i in range(processNum):
    if (i == processNum - 1):
        start = i * offset
        end = fileNum
    else:
        start = i * offset
        end = start + offset

    print str(start) + ' ' + str(end)
    pool.apply_async(get_features, args = (df_dic, pmi_dic, start, end, filelist, wordNum))

pool.close()
pool.join()
John Zwinck
  • 239,568
  • 38
  • 324
  • 436
Kaiyu Wang
  • 21
  • 1
  • Sorry for error pasting the code – Kaiyu Wang May 02 '16 at 02:28
  • You need a database. – jussij May 02 '16 at 02:29
  • See http://stackoverflow.com/questions/659865/python-multiprocessing-sharing-a-large-read-only-object-between-processes and http://stackoverflow.com/questions/17785275/share-large-read-only-numpy-array-between-multiprocessing-processes and http://stackoverflow.com/a/28503600/4323 – John Zwinck May 02 '16 at 02:59

0 Answers0