I'm running into a very peculiar issue with using multiprocessing pools in Python 3... See the code below:
import multiprocessing as MP
class c(object):
def __init__(self):
self.foo = ""
def a(self, b):
return b
def main(self):
with open("/path/to/2million/lines/file", "r") as f:
self.foo = f.readlines()
o = c()
o.main()
p = MP.Pool(5)
for r in p.imap(o.a, range(1,10)):
print(r)
If I execute this code as is, this is my extremely slow result:
1
2
3
4
5
6
7
8
9
real 0m6.641s
user 0m7.256s
sys 0m1.824s
However, if i removed the line o.main()
, then I get much faster execution time:
1
2
3
4
5
6
7
8
9
real 0m0.155s
user 0m0.048s
sys 0m0.004s
My environment has plenty of power, and I've made sure I'm not running into any memory limits. I also tested it with a smaller file, and execution time is much more acceptable. Any insight?
EDIT: I removed the disk IO part, and just created a list instead. I can prove the disk IO has nothing to do with the problem...
for i in range(1,500000):
self.foo.append("foobar%d\n"%i)
real 0m1.763s user 0m1.944s sys 0m0.452s
for i in range(1,1000000):
self.foo.append("foobar%d\n"%i)
real 0m3.808s user 0m4.064s sys 0m1.016s