0

I am trying to exploit parallelization in parsing data with Python 2.7 using the multiprocessing library. The task at hand is reading lots of large data files and returning their content as a dictionary or list. The problem is that when I try to return from my function, sending back the parsed data, my machine (running in Ubuntu Linux) hangs, with the memory and load indicators being at a maximum.

The code does something like this

import multiprocessing as mp

def worker(filex):
    """ reading lots of data """

    raw = filter(lambda x: len(x.split())>3,
        open(filex).readlines())
    data = {}
    ... # putting all the data in the data dictionary
    return data

# multiprocessing options
nproc = mp.cpu_count()
pool = mp.Pool(processes=nproc)
traj = pool.map(worker, tuple(files_to_parse))
pool.close()

The large data structure is the one that creates the problem. Interestingly if I return something else the code works, and even if I pass that data structure but being empty. Using data as a list instead of dictionary did not help.

Any suggestions?

daviddesancho
  • 427
  • 1
  • 7
  • 20
  • 1
    What data are you inputting there? Can you create a [self-contained complete example program](http://sscce.org/)? [This test program](https://gist.github.com/phihag/f96e4a38cf8702077a47) works fine for me. Does it work for you? – phihag Jul 25 '14 at 09:59
  • I have adapted your code in this [example](https://gist.github.com/daviddesancho/4815fa48ea49e7e49691) to do something a bit more data intensive. I am just reading a json file from [another stackoverflow question](http://stackoverflow.com/questions/9390368), putting its contents into a dictionary and returning that object. My system hangs as it did before. Again, if I return something else than the data object then the script will not crash my machine. – daviddesancho Jul 25 '14 at 10:58
  • 1
    Multiprocessing works best if the data communicated between the processes is small and there is not much communication needed. Otherwise the overhead of interprocess communication eats time and/or the need to serialize and deserialize huge data sets eats memory (and time). – BlackJack Jul 25 '14 at 23:12

1 Answers1

0

Your machine does not hang but works with all of its capacity to finish your program, so everything is working as expected. Eventually, your computer should finish its task.

That being said, there are some things you can do:

  • Improve the processing speed. For CPU-intensive programs, Python as a language, and Python's json module may not be the best fit. Do you really need the whole document parsed into Python objects?
  • Do not use multiprocessing. With a threaded application, you can just use one core (at least if your Python interpreter employs a GIL, but you won't have the overhead of pickling and unpickling the Python objects to transport them from one process to another.
  • Why not do the actual processing right in the worker? Do you really need all of these giant data structures in the memory of your original process?
  • If that's not an option, consider using imap or even imap_unordered instead of map. If your main process can process data structures faster than they're downloaded and parsed, the memory pressure should stay constant.
  • nice your Python processes (all of them or just the pool-created ones) in order to let the rest of the system run with higher priority. That won't make your program faster (in fact, it will typically even slow it down), but should make the other programs on your system (including your user interface) react faster.
  • Use faster or more CPUs and RAM.
  • Distribute the work over multiple machines. multiprocessing does not support that directly, but there are a number of Python projects that enable cluster processing
Community
  • 1
  • 1
phihag
  • 278,196
  • 72
  • 453
  • 469
  • I would say my machine hangs, yes, as it becomes completely unresponsive. At the end of the data the returned data will consume all of the memory. – daviddesancho Jul 25 '14 at 14:32
  • One [alternative](https://gist.github.com/daviddesancho/04efa74ca14741cb69a9) that I find to work in this alternative version of my example consists in writing to temporary files and returning paths to where the results are (which of course results in a loss of performance). – daviddesancho Jul 25 '14 at 14:38