2

I have a multiprocessor program which basically parses some XML information and returns the dictionary (one dictionary object for a file) as output and then, I am merging all the dictionaries into one final_dword.

if __name__ == '__main__':
  numthreads = 2  
  pool = mp.Pool(processes=numthreads)
  dword_list = pool.map(parse_xml, (locate("*.xml")))
  final_dword = {}
  print "The final Word Count dictionary is "
  map(final_dword.update,dword_list)
  print final_dword

The above code works perfectly fine for smaller data sets. As my datasize is growing, my program freezes during

map(final_dword.update,dword_list)

This is my assumption that my program freezes during the exe of above stmt. I tried to profile my code using muppy and found the following.

In n iteration (where n > 1200+, which means the program has basically processed around 1200+ files), I get the following stats:

Iteration  1259
                       types |   # objects |   total size
============================ | =========== | ============
                        dict |         660 |    511.03 KB
                         str |        6899 |    469.10 KB
                        code |        1979 |    139.15 KB
                        type |         176 |     77.00 KB
          wrapper_descriptor |        1037 |     36.46 KB
                        list |         307 |     23.41 KB
  builtin_function_or_method |         738 |     23.06 KB
           method_descriptor |         681 |     21.28 KB
                     weakref |         434 |     16.95 KB
                       tuple |         476 |     15.76 KB
                         set |         122 |     15.34 KB
         <class 'abc.ABCMeta |          18 |      7.88 KB
         function (__init__) |         130 |      7.11 KB
           member_descriptor |         226 |      7.06 KB
           getset_descriptor |         213 |      6.66 KB

I have 4 Gb RAM in my laptop and I am processing huge number of small (< 1MB) XML files. I am looking for a better way to merge the smaller dictionaries.

Rahul
  • 11,129
  • 17
  • 63
  • 76

2 Answers2

0

If you use Python 3.3 you could try if collections.ChainMap would be a solution for you. I have not used it yet, but it is supposed to be a quick way to link multiple dictionaries together. See the discussion here.

Maybe try to pickle dword_list to a file, and use a generator instead of keeping the list memory. In that way you stream the data instead of storing it. It should free some memory and make the program faster. Something like:

def xml_dict(): 
    for d in pickle.load("path/to/file.pickle"): 
        yield d
Community
  • 1
  • 1
Roman
  • 156
  • 5
  • Well, I am using python 2.X :( – Rahul Sep 21 '14 at 22:19
  • Maybe try a generator instead of storing the whole list. See the edited post. – Roman Sep 22 '14 at 08:09
  • Is there anyway, I can merge the dictionaries while they are being output by the process ( n =2 , in this case). I can save space in memory and dont have to pickle it. – Rahul Sep 22 '14 at 21:03
  • Did you have a look at the collections module, [defaultdict](https://docs.python.org/2/library/collections.html#collections.defaultdict) could be useful. It is supposed to be high performance. Or maybe [collections.Counter](https://docs.python.org/2/library/collections.html#counter-objects). – Roman Sep 23 '14 at 23:22
  • It still takes quite a while to update the dictionary and usually I run out of memory by then. I need to merge the elements (dicts) ofthe list as and when the parallel process updates them. that would atleast in theory be the fastest way of merging. But I am not sure, how to do that. – Rahul Sep 24 '14 at 00:51
  • Okay, that is a problem. You need a shared dictionary and add items to it with parse_xml. I have never done this, but I just came across processing.Manager maybe that is useful. The idea is to create a shared dictionary. [Here is an example](http://pymotw.com/2/multiprocessing/communication.html#managing-shared-state) – Roman Sep 25 '14 at 00:42
0

Using itertools you can chain containers

import itertools

listA = {1,2,3}
listB = {4,5,6}
listC = {7,8,9}

for key in itertools.chain(listA, listB, listC):
    print key,

Output: 1,2,3,4,5,6,7,8,9

This way you don't need to create a new container, it will run over the iterables until they are exhausted. It is the same as the user @roippi commented, but written differently.

dict(itertools.chain.from_iterable(x.iteritems() for x in dword_list))
user1767754
  • 23,311
  • 18
  • 141
  • 164