3

I'm running into a very peculiar issue with using multiprocessing pools in Python 3... See the code below:

import multiprocessing as MP                                       

class c(object):                                                   
    def __init__(self):                                            
        self.foo = ""                                              

    def a(self, b):                                                
        return b                                                   

    def main(self):                                                
        with open("/path/to/2million/lines/file", "r") as f:
            self.foo = f.readlines()                               

o = c()                                                            
o.main()                                                           
p = MP.Pool(5)                                                     
for r in p.imap(o.a, range(1,10)):                                 
    print(r)                                                       

If I execute this code as is, this is my extremely slow result:

1
2
3
4
5
6
7
8
9

real    0m6.641s
user    0m7.256s
sys     0m1.824s                    

However, if i removed the line o.main(), then I get much faster execution time:

1
2
3
4
5
6
7
8
9

real    0m0.155s
user    0m0.048s
sys     0m0.004s

My environment has plenty of power, and I've made sure I'm not running into any memory limits. I also tested it with a smaller file, and execution time is much more acceptable. Any insight?

EDIT: I removed the disk IO part, and just created a list instead. I can prove the disk IO has nothing to do with the problem...

for i in range(1,500000):
    self.foo.append("foobar%d\n"%i)

real    0m1.763s user    0m1.944s sys     0m0.452s

for i in range(1,1000000):
    self.foo.append("foobar%d\n"%i)
real    0m3.808s user    0m4.064s sys     0m1.016s
f-z
  • 85
  • 1
  • 7
  • How long does the `o.main()` take on its own? (Without the following MP code.) – viraptor Jul 18 '17 at 00:06
  • `real 0m0.182s user 0m0.112s sys 0m0.068s` The file size is actually only 27M. – f-z Jul 18 '17 at 00:17
  • Can you try it with [`threadpoolexecutor`](https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor) and/or [`ThreadPool`](https://stackoverflow.com/a/3386632/1189040) to see if it has something to do with the process overhead ? – Himal Jul 18 '17 at 01:46

1 Answers1

5

Under the hood, multiprocessing.Pool uses a Pipe to transfer the data from the parent process to the Pool workers.

This adds a hidden cost to the scheduling of tasks as the entire o object gets serialised into a Pickle object and transferred via an OS pipe.

This is done for each and every task you are scheduling (10 times in your example). If your file is 10 Mb in size, you are shifting 100Mb of data.

According to the multiprocessing Programming Guidelines:

As far as possible one should try to avoid shifting large amounts of data between processes.

A simple way to speed up your logic would be calculating the amount of lines in your file, splitting them in equal chunks, sending only the line indexes to the worker processes and let them open the file, seek the right line and process the data.

noxdafox
  • 14,439
  • 4
  • 33
  • 45