unexpected memory footprint differences when spawning python multiprocessing pool

Question

Trying to contribute some optimization for the parallelization in the pystruct module and in discussions trying to explain my thinking for why I wanted to instantiate pools as early in the execution as possible and keep them around as long as possible, reusing them, I realized I know that it works best to do this, but I don't completely know why.

I know that the claim, on *nix systems, is that a pool worker subprocess copies on write from all the globals in the parent process. This is definitely the case on the whole, but I think a caveat should be added that when one of those globals is a particularly dense data structure like a numpy or scipy matrix, it appears that whatever references get copied down into the worker are actually pretty sizeable even if the whole object isn't being copied, and so spawning new pools late in the execution can cause memory issues. I have found the best practice is to spawn a pool as early as possible, so that any data structures are small.

I have known this for a while and engineered around it in applications at work but the best explanation I've gotten is what I posted in the thread here:

https://github.com/pystruct/pystruct/pull/129#issuecomment-68898032

Looking at the python script below, essentially, you would expect free memory in the pool created step in the first run and the matrix created step in the second to be basically equal, as in both final pool terminated calls. But they never are, there is always (unless something else is going on on the machine of course) more free memory when you create the pool first. That effect increases with the complexity (and size) of the data structures in the global namespace at the time the pool is created (I think). Does anyone have a good explanation for this?

I made this little picture with the bash loop and the R script also below to illustrate, showing the free memory overall after both pool and matrix are created, depending on the order:

free memory trend plot, both ways

pool_memory_test.py:

import numpy as np
import multiprocessing as mp
import logging

def memory():
    """
    Get node total memory and memory usage
    """
    with open('/proc/meminfo', 'r') as mem:
        ret = {}
        tmp = 0
        for i in mem:
            sline = i.split()
            if str(sline[0]) == 'MemTotal:':
                ret['total'] = int(sline[1])
            elif str(sline[0]) in ('MemFree:', 'Buffers:', 'Cached:'):
                tmp += int(sline[1])
        ret['free'] = tmp
        ret['used'] = int(ret['total']) - int(ret['free'])
    return ret

if __name__ == '__main__':
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('--pool_first', action='store_true')
    parser.add_argument('--call_map', action='store_true')
    args = parser.parse_args()

    if args.pool_first:
        logging.debug('start:\n\t {}\n'.format(' '.join(['{}: {}'.format(k,v)
            for k,v in memory().items()])))
        p = mp.Pool()
        logging.debug('pool created:\n\t {}\n'.format(' '.join(['{}: {}'.format(k,v)
            for k,v in memory().items()])))
        biggish_matrix = np.ones((50000,5000))
        logging.debug('matrix created:\n\t {}\n'.format(' '.join(['{}: {}'.format(k,v)
            for k,v in memory().items()])))
        print memory()['free']
    else:
        logging.debug('start:\n\t {}\n'.format(' '.join(['{}: {}'.format(k,v)
            for k,v in memory().items()])))
        biggish_matrix = np.ones((50000,5000))
        logging.debug('matrix created:\n\t {}\n'.format(' '.join(['{}: {}'.format(k,v)
            for k,v in memory().items()])))
        p = mp.Pool()
        logging.debug('pool created:\n\t {}\n'.format(' '.join(['{}: {}'.format(k,v)
            for k,v in memory().items()])))
        print memory()['free']
    if args.call_map:
        row_sums = p.map(sum, biggish_matrix)
        logging.debug('sum mapped:\n\t {}\n'.format(' '.join(['{}: {}'.format(k,v)
            for k,v in memory().items()])))
        p.terminate()
        p.join()
        logging.debug('pool terminated:\n\t {}\n'.format(' '.join(['{}: {}'.format(k,v)
            for k,v in memory().items()])))

pool_memory_test.sh

#! /bin/bash
rm pool_first_obs.txt > /dev/null 2>&1;
rm matrix_first_obs.txt > /dev/null 2>&1;
for ((n=0;n<100;n++)); do
    python pool_memory_test.py --pool_first >> pool_first_obs.txt;
    python pool_memory_test.py >> matrix_first_obs.txt;
done

pool_memory_test_plot.R:

library(ggplot2)
library(reshape2)
pool_first = as.numeric(readLines('pool_first_obs.txt'))
matrix_first = as.numeric(readLines('matrix_first_obs.txt'))
df = data.frame(i=seq(1,100), pool_first, matrix_first)
ggplot(data=melt(df, id.vars='i'), aes(x=i, y=value, color=variable)) +
    geom_point() + geom_smooth() + xlab('iteration') + 
    ylab('free memory') + ggsave('multiprocessing_pool_memory.png')

EDIT: fixing small bug in script caused by overzealous find/replace and rerunning fixed

EDIT2: "-0" slicing? You can do that? :)

EDIT3: better python script, bash looping and visualization, ok done with this rabbit hole for now :)

`-0` is not an index. [it's a slice notation](http://stackoverflow.com/a/509295/1189040) — Himal, Jan 07 '15 at 01:20
:) thanks yeah, I knew that, fixed term. I just didn't realize -0 was a valid slice. I guess it makes sense that 0 and -0 would both return the first element. Can't think of a valid use case for it off the top of my head though. — Robert E Mealey, Jan 07 '15 at 01:21
beginnings of an answer: https://twitter.com/ChristianHeimes/status/552760084162678784 https://twitter.com/ChristianHeimes/status/552760759428857856 — Robert E Mealey, Apr 14 '15 at 20:35
"The reference counter in PyObject structs doesn't play well with copy-on-write fork(). Any write causes a copy of a Memory page of 4096 bytes. I don't know how NumPy stores its arrays internally. A r/o view of shared memory may help." — Robert E Mealey, Apr 14 '15 at 20:36

score 2 · Accepted Answer · answered Apr 24 '15 at 04:35

Your question touches several loosely coupled mechanics. And it's also one that seems an easy target for additional karma points, but you can feel something's wrong and 3 hours later it's a completely different question. So in return for all the fun I had, you may find useful some of the information below.

TL;DR: Measure used memory, not free. That gives consistent results of (almost) the same result for pool/matrix order and large object size for me.

def memory():
    import resource
    # RUSAGE_BOTH is not always available
    self = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
    children = resource.getrusage(resource.RUSAGE_CHILDREN).ru_maxrss
    return self + children

Before answering questions you didn't ask, but those closely related, here's some background.

Background

The most widespread implementation, CPython (both 2 and 3 versions) use reference counting memory management [1]. Whenever you use Python object as value, it's reference counter is increased by one, and decreased back when reference is lost. The counter is an integer defined in C struct holding data of each Python object [2]. Takeaway: reference counter is changing all the time, it is stored along with the rest of object data.

Most "Unix inspired OS" (BSD family, Linux, OSX, etc) sport copy-on-write [3] memory access semantic. After fork(), two processes have distinct memory page tables pointing to the same physical pages. But OS has marked the pages as write-protected, so when you do any memory write, CPU raises memory access exception, which is handled by OS to copy original page into new place. It walks and quacks like process has isolated memory, but hey, let's save some time (on copying) and RAM while parts of memory are equivalent. Takeaway: fork (or mp.Pool) create new processes, but they (almost) don't use any extra memory just yet.

CPython stores "small" objects in large pools (arenas) [4]. In common scenario where you create and destroy large number of small objects, for example, temporary variables inside a function, you don't want to call OS memory management too often. Other programming languages (most compiled ones, at least), use stack for this purpose.

Sources

thanks that's helpful! I actually think this: ""the claim, on *nix systems, is that a pool worker subprocess copies on write from all the globals in the parent process": multiprocessing copies globals of it's "context", not globals from your module and it does so unconditionally, on any OS. [5]" might end up being the basis for the answer to my question. Lemme explore and if so I'll send you the gist and you can update and get the check mark, since you put in so much work :) — Robert E Mealey, Apr 25 '15 at 15:11

unexpected memory footprint differences when spawning python multiprocessing pool

1 Answers1

Background

Related questions

Sources

Linked