I have a multiprocessor program which basically parses some XML information and returns the dictionary (one dictionary object for a file) as output and then, I am merging all the dictionaries into one final_dword
.
if __name__ == '__main__':
numthreads = 2
pool = mp.Pool(processes=numthreads)
dword_list = pool.map(parse_xml, (locate("*.xml")))
final_dword = {}
print "The final Word Count dictionary is "
map(final_dword.update,dword_list)
print final_dword
The above code works perfectly fine for smaller data sets. As my datasize is growing, my program freezes during
map(final_dword.update,dword_list)
This is my assumption that my program freezes during the exe of above stmt. I tried to profile my code using muppy and found the following.
In n iteration (where n > 1200+, which means the program has basically processed around 1200+ files), I get the following stats:
Iteration 1259 types | # objects | total size ============================ | =========== | ============ dict | 660 | 511.03 KB str | 6899 | 469.10 KB code | 1979 | 139.15 KB type | 176 | 77.00 KB wrapper_descriptor | 1037 | 36.46 KB list | 307 | 23.41 KB builtin_function_or_method | 738 | 23.06 KB method_descriptor | 681 | 21.28 KB weakref | 434 | 16.95 KB tuple | 476 | 15.76 KB set | 122 | 15.34 KB <class 'abc.ABCMeta | 18 | 7.88 KB function (__init__) | 130 | 7.11 KB member_descriptor | 226 | 7.06 KB getset_descriptor | 213 | 6.66 KB
I have 4 Gb RAM in my laptop and I am processing huge number of small (< 1MB) XML files. I am looking for a better way to merge the smaller dictionaries.