Trying to contribute some optimization for the parallelization in the pystruct module and in discussions trying to explain my thinking for why I wanted to instantiate pools as early in the execution as possible and keep them around as long as possible, reusing them, I realized I know that it works best to do this, but I don't completely know why.
I know that the claim, on *nix systems, is that a pool worker subprocess copies on write from all the globals in the parent process. This is definitely the case on the whole, but I think a caveat should be added that when one of those globals is a particularly dense data structure like a numpy or scipy matrix, it appears that whatever references get copied down into the worker are actually pretty sizeable even if the whole object isn't being copied, and so spawning new pools late in the execution can cause memory issues. I have found the best practice is to spawn a pool as early as possible, so that any data structures are small.
I have known this for a while and engineered around it in applications at work but the best explanation I've gotten is what I posted in the thread here:
https://github.com/pystruct/pystruct/pull/129#issuecomment-68898032
Looking at the python script below, essentially, you would expect free memory in the pool created step in the first run and the matrix created step in the second to be basically equal, as in both final pool terminated calls. But they never are, there is always (unless something else is going on on the machine of course) more free memory when you create the pool first. That effect increases with the complexity (and size) of the data structures in the global namespace at the time the pool is created (I think). Does anyone have a good explanation for this?
I made this little picture with the bash loop and the R script also below to illustrate, showing the free memory overall after both pool and matrix are created, depending on the order:
pool_memory_test.py:
import numpy as np
import multiprocessing as mp
import logging
def memory():
"""
Get node total memory and memory usage
"""
with open('/proc/meminfo', 'r') as mem:
ret = {}
tmp = 0
for i in mem:
sline = i.split()
if str(sline[0]) == 'MemTotal:':
ret['total'] = int(sline[1])
elif str(sline[0]) in ('MemFree:', 'Buffers:', 'Cached:'):
tmp += int(sline[1])
ret['free'] = tmp
ret['used'] = int(ret['total']) - int(ret['free'])
return ret
if __name__ == '__main__':
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--pool_first', action='store_true')
parser.add_argument('--call_map', action='store_true')
args = parser.parse_args()
if args.pool_first:
logging.debug('start:\n\t {}\n'.format(' '.join(['{}: {}'.format(k,v)
for k,v in memory().items()])))
p = mp.Pool()
logging.debug('pool created:\n\t {}\n'.format(' '.join(['{}: {}'.format(k,v)
for k,v in memory().items()])))
biggish_matrix = np.ones((50000,5000))
logging.debug('matrix created:\n\t {}\n'.format(' '.join(['{}: {}'.format(k,v)
for k,v in memory().items()])))
print memory()['free']
else:
logging.debug('start:\n\t {}\n'.format(' '.join(['{}: {}'.format(k,v)
for k,v in memory().items()])))
biggish_matrix = np.ones((50000,5000))
logging.debug('matrix created:\n\t {}\n'.format(' '.join(['{}: {}'.format(k,v)
for k,v in memory().items()])))
p = mp.Pool()
logging.debug('pool created:\n\t {}\n'.format(' '.join(['{}: {}'.format(k,v)
for k,v in memory().items()])))
print memory()['free']
if args.call_map:
row_sums = p.map(sum, biggish_matrix)
logging.debug('sum mapped:\n\t {}\n'.format(' '.join(['{}: {}'.format(k,v)
for k,v in memory().items()])))
p.terminate()
p.join()
logging.debug('pool terminated:\n\t {}\n'.format(' '.join(['{}: {}'.format(k,v)
for k,v in memory().items()])))
pool_memory_test.sh
#! /bin/bash
rm pool_first_obs.txt > /dev/null 2>&1;
rm matrix_first_obs.txt > /dev/null 2>&1;
for ((n=0;n<100;n++)); do
python pool_memory_test.py --pool_first >> pool_first_obs.txt;
python pool_memory_test.py >> matrix_first_obs.txt;
done
pool_memory_test_plot.R:
library(ggplot2)
library(reshape2)
pool_first = as.numeric(readLines('pool_first_obs.txt'))
matrix_first = as.numeric(readLines('matrix_first_obs.txt'))
df = data.frame(i=seq(1,100), pool_first, matrix_first)
ggplot(data=melt(df, id.vars='i'), aes(x=i, y=value, color=variable)) +
geom_point() + geom_smooth() + xlab('iteration') +
ylab('free memory') + ggsave('multiprocessing_pool_memory.png')
EDIT: fixing small bug in script caused by overzealous find/replace and rerunning fixed
EDIT2: "-0" slicing? You can do that? :)
EDIT3: better python script, bash looping and visualization, ok done with this rabbit hole for now :)