127

I have an array (called data_inputs) containing the names of hundreds of astronomy images files. These images are then manipulated. My code works and takes a few seconds to process each image. However, it can only do one image at a time because I'm running the array through a for loop:

for name in data_inputs:
    sci=fits.open(name+'.fits')
    #image is manipulated

There is no reason why I have to modify an image before any other, so is it possible to utilise all 4 cores on my machine with each core running through the for loop on a different image?

I've read about the multiprocessing module but I'm unsure how to implement it in my case. I'm keen to get multiprocessing to work because eventually I'll have to run this on 10,000+ images.

martineau
  • 119,623
  • 25
  • 170
  • 301
ChrisFro
  • 2,723
  • 4
  • 15
  • 8

4 Answers4

143

You can simply use multiprocessing.Pool:

from multiprocessing import Pool

def process_image(name):
    sci=fits.open('{}.fits'.format(name))
    <process>

if __name__ == '__main__':
    pool = Pool()                         # Create a multiprocessing Pool
    pool.map(process_image, data_inputs)  # process data_inputs iterable with pool
MegaIng
  • 7,361
  • 1
  • 22
  • 35
alko
  • 46,136
  • 12
  • 94
  • 102
  • 25
    It might be better to use: `pool = Pool(os.cpu_count())` This is a more generic way of using multiprocessing. – Lior Magen Apr 13 '16 at 12:26
  • 2
    Note: `os.cpu_count()` was added in Python 3.4. For Python 2.x, use `multiprocessing.cpu_count()`. – dwj Sep 14 '16 at 17:12
  • 38
    `Pool()` is the same as `Pool(os.cpu_count())` – Tim Sep 19 '16 at 15:43
  • 22
    To elaborate on @Tim's comment - `Pool()` called without a value for `processes` is the same as `Pool(processes=cpu_count())` regardless of whether you are using Python 3 or 2 - so the best practice in EITHER version is to use `Pool()`. https://docs.python.org/2/library/multiprocessing.html – Kyle Pittman Nov 02 '16 at 18:22
  • @Alko : I am unable to understand what is "data_inputs" here. You haven't defined it. What value should I give it? – Abhishek dot py Apr 27 '17 at 14:18
  • 12
    @LiorMagen , if I'm not mistaken, using Pool(os.cpu_count()) will make the OS freeze until the processing is over, as you don't leave the OS any free cores. For a lot of users Pool(os.cpu_count() - 1) might be a better choice – shayelk Oct 09 '17 at 12:16
  • @shayelk You are absolutely right and I always leave at least one CPU alone. I gave it just as an example for using multiprocessing. – Lior Magen Oct 09 '17 at 14:03
  • 1
    There must be pool.close() at the end. – Spas Jul 29 '18 at 15:57
  • 1
    @Abhishekdotpy It is the same name iterable used by the OP. The OP was looping `for name in data_inputs:` – timctran Apr 08 '20 at 17:58
34

You can use multiprocessing.Pool:

from multiprocessing import Pool
class Engine(object):
    def __init__(self, parameters):
        self.parameters = parameters
    def __call__(self, filename):
        sci = fits.open(filename + '.fits')
        manipulated = manipulate_image(sci, self.parameters)
        return manipulated

try:
    pool = Pool(8) # on 8 processors
    engine = Engine(my_parameters)
    data_outputs = pool.map(engine, data_inputs)
finally: # To make sure processes are closed in the end, even if errors happen
    pool.close()
    pool.join()
senshin
  • 10,022
  • 7
  • 46
  • 59
ixxo
  • 349
  • 2
  • 2
  • 3
    I am unable to understand what is "data_inputs" here. You haven't defined it. What value should I give it? – Abhishek dot py Apr 27 '17 at 14:17
  • 2
    It actually stems from alko's answer, I'm citing his comment (see the code block): "proces data_inputs iterable with pool". So `data_inputs` is an iterable (like in a standard `map`). – ponadto Nov 10 '17 at 06:52
  • The [python doc](https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.pool) only shows that one can pass a function to `pool.map(func, iterable[, chunksize])`. When passing an object, will this object be shared by all the processes? Thus, could I have all processes write to the same list `self.list_` in the object? – Philipp Oct 09 '20 at 08:34
10

Alternatively

with Pool() as pool: 
    pool.map(fits.open, [name + '.fits' for name in datainput])
Spas
  • 840
  • 16
  • 13
4

I would suggest to use imap_unordered with chunksize if you are only using a for loop to iterate over an iterable. It will return results from each loop as soon as they are calculated. map waits for all results to be computed and hence is blocking.

freude
  • 3,632
  • 3
  • 32
  • 51
Coddy
  • 549
  • 4
  • 18