2

I am running a time consuming program a lot of times. I have the chance to have access to a cluster where I can require 504 processors, but customer service is let's say slow, so I turn to you SO. I am using a very simple application as follow:

import multiprocessing

def function(data):
    data = complicated_function_I_was_given(data)
    with open('unique_id', 'w') as f:
        f.write(data)

pool = multiprocessing.Pool(504)
pool.map(function, data_iterator)

Now, although I can see the processes start (the 'complicated_function_I_was_given' writes a bunch of scrap, but with unique names so I am sure there is no clash), the process seems really slow. I am expecting some data in data_iterator to be processed immediately, although some will take days, yet after 1 day nothing has been produced. Could it be that multiprocessing.Pool() has a limit? Or that it doesn't distributes the processes over different nodes (I know each node has 12 cores)? And I am using python2.6.5.

Zenon
  • 1,481
  • 12
  • 21

2 Answers2

4

Or that it doesn't distributes the processes over different nodes (I know each node has 12 cores)? And I am using python2.6.5.

I think this is your problem: unless your cluster architecture is very unusual, and all the processors appear to be on the same logical machine, then multiprocessing will only have access to the local cores. You probably need to use a different parallelisation library.

See also the answers to this question.

Community
  • 1
  • 1
James
  • 24,676
  • 13
  • 84
  • 130
  • Thanks for the link, I think you are right. I don't know how I could have miss that question! Now to play with mpi4py than. – Zenon Feb 27 '12 at 00:09
1

You might try scaling the work with one of Python's many parallel libraries, I've not heard of scaling work over so many processors with just multiprocessing.

bluemoon
  • 111
  • 3