I'm needing to make many computational iterations, say 100 or so, with a very complicated function that accepts a number of input parameters. Though the parameters will be varied, each iteration will take nearly the same amount of time to compute. I found this post by Muhammad Alkarouri. I modified his code so as to fork the script N times where I would plan to set N equal at least to the number of cores on my desktop computer.
It's an old MacPro 1,0 running 32-bit Ubuntu 16.04 with 18GB of RAM, and in none of the runs of the test file below does the RAM usage exceed about 15% (swap is never used). That's according to the System Monitor's Resources tab, on which it is also shown that when I try running 4 iterations in parallel all four CPUs are running at 100%, whereas if I only do one iteration only one CPU is 100% utilized and the other 3 are idling. (Here are the actual specs for the computer. And cat /proc/self/status
shows a Cpus_allowed
of 8, twice the total number of cpu cores, which may indicate hyper-threading.)
So I expected to find that the 4 simultaneous runs would consume only a little more time than one (and going forward I generally expected run times to pretty much scale in inverse proportion to the number of cores on any computer). However, I found to the contrary that instead of the running time for 4 being not much more than the running time for one, it is instead a bit more than twice the running time for one. For example, with the scheme presented below, a sample "time(1) real" value for a single iteration is 0m7.695s whereas for 4 "simultaneous" iterations it is 0m17.733s. And when I go from running 4 iterations at once to 8, the running time scales in proportion.
So my question is, why doesn't it scale as I had supposed (and can anything be done to fix that)? This is, by the way, for deployment on a few desktops; it doesn't have to scale or run on Windows.
Also, I have for now neglected the multiprocessing.Pool()
alternative as my function was rejected as not being pickleable.
Here is the modified Alkarouri script, multifork.py
:
#!/usr/bin/env python
import os, cPickle, time
import numpy as np
np.seterr(all='raise')
def function(x):
print '... the parameter is ', x
arr = np.zeros(5000).reshape(-1,1)
for r in range(3):
for i in range(200):
arr = np.concatenate( (arr, np.log( np.arange(2, 5002) ).reshape(-1,1) ), axis=1 )
return { 'junk': arr, 'shape': arr.shape }
def run_in_separate_process(func, *args, **kwds):
numruns = 4
result = [ None ] * numruns
pread = [ None ] * numruns
pwrite = [ None ] * numruns
pid = [ None ] * numruns
for i in range(numruns):
pread[i], pwrite[i] = os.pipe()
pid[i] = os.fork()
if pid[i] > 0:
pass
else:
os.close(pread[i])
result[i] = func(*args, **kwds)
with os.fdopen(pwrite[i], 'wb') as f:
cPickle.dump((0,result[i]), f, cPickle.HIGHEST_PROTOCOL)
os._exit(0)
#time.sleep(17)
while True:
for i in range(numruns):
os.close(pwrite[i])
with os.fdopen(pread[i], 'rb') as f:
stat, res = cPickle.load(f)
result[i] = res
#os.waitpid(pid[i], 0)
if not None in result: break
return result
def main():
print 'Running multifork.py.'
print run_in_separate_process( function, 3 )
if __name__ == "__main__":
main()
With multifork.py
, uncommenting os.waitpid(pid[i], 0)
has no effect. And neither does uncommenting time.sleep()
if 4 iterations are being run at once and if the delay is not set to more than about 17 seconds. Given that time(1) real is something like 0m17.733s for 4 iterations done at once, I take that to be an indication that the While True loop is not itself a cause of any appreciable inefficiency (due to the processes all taking the same amount of time) and that the 17 seconds are indeed being consumed solely by the child processes.
Out of a profound sense of mercy I have spared you for now my other scheme, with which I employed subprocess.Popen()
in lieu of os.fork()
. With that one I had to send the function to an auxiliary script, the script that defines the command that is the first argument of Popen()
, via a file. I did however use the same While True loop. And the results? They were the same as with the simpler scheme that I am presenting here--- almost exactly.