I have written some Python code that extracts information from a large database, performs some maximum-likelihood modeling of each item in the database, then stores the results. The code works fine serially and each item takes 1-2 seconds. The problem is, I have a few million items in the database, so a serial run of the code will take weeks. I have access to a cluster that has 16 cpus, and I've written a function that breaks the intial input catalog into "chunks," runs the modeling code on each chunk, then organizes the results. Because of the complexity of the modeling portion of the code and its reliance on 3rd party python software, I've been trying to parallelize this code using multiprocessing so that each chunk of the input catalog is run on a separate cpu. Currently, the processes spawned by my attempt are only being run on a single cpu. Does anyone know how to fix this? I am a self-taught scientific programmer, so I have some experience using Python, though this is the first time I've attempt parallelization. The code is far too long to show here in its entirety, but it essentially does the following:
def split(a, n):
k, m = len(a) / n, len(a) % n
return (a[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in xrange(n))
ncpus = mp.cpu_count()
chunks = list(split(catlist, ncpus))
images = [im1, im2, im3, im4, im5, im6, im7, im8, im9, im10, im11, im12]
def process_chunks(chunk):
database_chunk = database[chunk[:]]
for i in range(0,len(database_chunk)):
# perform some analysis
for j in range(0,len(images)):
# perform more analysis
my_dict = {'out1': out1, 'out2':out2, 'out3': out3}
return my_dict
pool = mp.Pool(processes=ncpus)
result = [pool.apply(process_chunks, args=(chunks[id],)) for id in range(0,len(chunks))]
for a simple test example of only 13 items from the database, this is what chunks looks like (list of roughly evenly split lists):
chunks = [[0, 1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]
The code runs fine, and the pool.apply command spawns multiple processes, but all processes are run on a single cpu and there is no improvement in runtime over serial (in fact it's a bit slower, I guess due to the overhead to generate each process). Is there a way to force each process to be assigned to a separate cpu so I get some improvement when I apply this to the full database (which has millions of items)? Or perhaps a better way to go about this problem in general?
Thank you so much in advance!
Note: In case it clarifies the situation further, I'm an astronomer and the database is a large source catalog with over a million entries, the images are wide-field mosaics in 12 different bands, and I'm performing maximum-likelihood photometry on each band of each source.