64

I'm trying to learn how to use Python's multiprocessing package, but I don't understand the difference between map and imap.

Is the difference that map returns, say, an actual array or set, while imap returns an iterator over an array or set? When would I use one over the other?

Also, I don't understand what the chunksize argument is. Is this the number of values that are passed to each process?

grautur
  • 29,955
  • 34
  • 93
  • 128
  • 6
    Closely related: [multiprocessing.pool: What's the difference between map_async and imap?](http://stackoverflow.com/questions/26520781/multiprocessing-pool-whats-the-difference-between-map-async-and-imap/26521507#26521507) – dano Jan 28 '16 at 19:34

4 Answers4

55

That is the difference. One reason why you might use imap instead of map is if you wanted to start processing the first few results without waiting for the rest to be calculated. map waits for every result before returning.

As for chunksize, it is sometimes more efficient to dole out work in larger quantities because every time the worker requests more work, there is IPC and synchronization overhead.

Antimony
  • 37,781
  • 10
  • 100
  • 107
  • 1
    So how does one approach determining a reasonable value for chunksize then? If bigger means less IPC & sync overhead due to pickling, what's the tradeoff? (ie why is picking `chunksize == len(iterable)` a bad idea, or is it?) – Adam Parkin Aug 24 '12 at 17:54
  • 2
    @Adam If you pick `chunksize = len(iterable)`, then all the jobs will be assigned to a single process! `len(iterable) // numprocesses` is the maximum that is useful. The tradeoff is between synchronization overhead and cpu utilization (large chunksizes will cause some processes to finish before others, wasting potential processing time). – Antimony Oct 03 '12 at 18:27
  • Ok, I see that, but that simply mean picking a reasonable chunksize boils down to trial and error on particular data in a particular setting? – Adam Parkin Oct 03 '12 at 18:29
  • I think so. Most optimization requires profiling and fine tuning. – Antimony Oct 03 '12 at 22:34
  • 1
    It's also worth mentioning that imap can be applied to a generator, while map will turn your generator into a list-like object, so imap doesn't wait for the input to get generated. – mgoldwasser Dec 08 '15 at 21:07
  • Does that mean that `imap` returns the first result from the x number of processes running? i.e. does it retain order? Does that mean that you still have to wait for the first process to finish before you can start the iteration on the results? – Tjorriemorrie Jan 13 '16 at 13:31
5

imap is from itertools module which is used for fast and memory efficiency in python.Map will return the list where as imap returns the object which generates the values for each iterations(In python 2.7).The below code blocks will clear the difference.

Map returns the list can be printed directly

 from itertools import *
    from math import *

    integers = [1,2,3,4,5]
    sqr_ints = map(sqrt, integers)
    print (sqr_ints)

imap returns object which is converted to list and printed.

from itertools import *
from math import *

integers = [1,2,3,4,5]
sqr_ints = imap(sqrt, integers)
print list(sqr_ints)

Chunksize will make the iterable to be split into pieces of specified size(approximate) and each piece is submitted as a separate task.

Chandan
  • 704
  • 7
  • 20
  • 3
    While the spirit of this answer is correct, notice that the original question is asking about **multiprocessing** map and imap. – Jacob Jun 27 '20 at 22:54
5

With imap, forked calls are done in parallel, not one after another sequentially. For example, below you're hitting say three exchanges to get order books. Instead of hitting exchange 1, then exchange 2, then exchange 3 sequentially, imap.pool calls are non-blocking and goes straight to all three exchanges to fetch order books as soon as you call.

from pathos.multiprocessing import ProcessingPool as Pool
pool = Pool().imap
self.pool(self.getOrderBook, Exchanges, Tickers)
2

One key difference is when and how your worker function returns its result. If you use your worker for its side effects (creating files etc.) and don't expect it to return anything then this does not apply to you.

from multiprocessing import Pool
import time

start = time.time()

def get_time():
    return int(time.time() - start)

def worker(args):
    name, delay = args
    print(f'{get_time()}: Job {name} started ({delay} seconds)')
    time.sleep(delay)
    return f'Job {name} done'

jobs = [
    ('A', 1),
    ('B', 2),
    ('C', 10),
    ('D', 3),
    ('E', 4),
    ('F', 5),
]

if __name__ == '__main__':
    with Pool(2) as pool:
        for result in pool.map(worker, jobs):
            print(f'{get_time()}: {result}')

If you use map, the code generates this output:

 0: Job A started (1 seconds)
 0: Job B started (2 seconds)
 1: Job C started (10 seconds)
 2: Job D started (3 seconds)
 5: Job E started (4 seconds)
 9: Job F started (5 seconds)
14: Job A done
14: Job B done
14: Job C done
14: Job D done
14: Job E done
14: Job F done

As you can see, all jobs are returned in a bulk and in the input order in the 14th second, regardless of when they actually finished.

If you change the method to imap, the code then generates this output:

 0: Job A started (1 seconds)
 0: Job B started (2 seconds)
 1: Job C started (10 seconds)
 1: Job A done
 2: Job D started (3 seconds)
 2: Job B done
 5: Job E started (4 seconds)
 9: Job F started (5 seconds)
11: Job C done
11: Job D done
11: Job E done
14: Job F done

Now the full code finishes again in the 14th second but some jobs (A, B) are returned earlier, when they actually finished. This method still keeps the input order so even though (you can calculate that) jobs D and E finished in the 5th and 9th second, they could not be returned earlier - they still had to wait for the long job C until the 11th second.

If you change the method to imap_unordered, the code then generates this output:

 0: Job A started (1 seconds)
 0: Job B started (2 seconds)
 1: Job C started (10 seconds)
 1: Job A done
 2: Job D started (3 seconds)
 2: Job B done
 5: Job E started (4 seconds)
 5: Job D done
 9: Job F started (5 seconds)
 9: Job E done
11: Job C done
14: Job F done

Now all jobs are returned immediately when they finish. The input order is not preserved.

Jeyekomon
  • 2,878
  • 2
  • 27
  • 37