7

I have a Python 3.6 data processing task that involves pre-loading a large dict for looking up dates by ID for use in a subsequent step by a pool of sub-processes managed by the multiprocessing module. This process was eating up most if not all of the memory on the box, so one optimisation I applied was to 'intern' the string dates being stored in the dict. This reduced the memory footprint of the dict by several GBs as I expected it would, but it also had another unexpected effect.

Before applying interning, the sub-processes would gradually eat more and more memory as they executed, which I believe was down to them having to copy the dict gradually from global memory across to the sub-processes' individual allocated memory (this is running on Linux and so benefits from the copy-on-write behaviour of fork()). Even though I'm not updating the dict in the sub-processes, it looks like read-only access can still trigger copy-on-write through reference counting.

I was only expecting the interning to reduce the memory footprint of the dict, but in fact it stopped the memory usage gradually increasing over the sub-processes lifetime as well.

Here's a minimal example I was able to build that replicates the behaviour, although it requires a large file to load in and populate the dict with and a sufficient amount of repetition in the values to make sure that interning provides a benefit.

import multiprocessing
import sys

# initialise a large dict that will be visible to all processes
# that contains a lot of repeated values
global_map = dict()
with open(sys.argv[1], 'r', encoding='utf-8') as file:
  if len(sys.argv) > 2:
    print('interning is on')
  else:
    print('interning is off')
  for i, line in enumerate(file):
    if i > 30000000:
      break
    parts = line.split('|')
    if len(sys.argv) > 2:
      global_map[str(i)] = sys.intern(parts[2])
    else:
      global_map[str(i)] = parts[2]

def read_map():
  # do some nonsense processing with each value in the dict
  global global_map
  for i in range(30000000):
    x = global_map[str(i)]
  y = x + '_'
  return y

print("starting processes")
process_pool = multiprocessing.Pool(processes=10)

for _ in range(10):
  process_pool.apply_async(read_map)

process_pool.close()

process_pool.join()

I ran this script and monitored htop to see the total memory usage.

interning? mem usage just after 'starting processes' printed peak mem usage after that
no 7.1GB 28.0GB
yes 5.5GB 5.6GB

While I am delighted that this optimisation seems to have fixed all my memory issues at once, I'd like to understand better why this works. If the creeping memory usage by the sub-processes is down to copy-on-write, why doesn't this happen if I intern the strings?

Booboo
  • 38,656
  • 3
  • 37
  • 60
Rob Streeting
  • 1,675
  • 3
  • 16
  • 27
  • Of potential interest: [Python Doc](https://docs.python.org/3/library/sys.html#sys.intern), [Related SO answer](https://stackoverflow.com/a/1136852/10640534). – Patol75 May 24 '21 at 06:23

2 Answers2

4

The CPython implementation stores interned strings in a global object that is a regular Python dictionary where both, keys and values are pointers to string objects.

When a new child process is created, it gets a copy of the parent's address space so they will use the reduced data dictionary with interned strings.

I've compiled Python with the patch below and as you can see, both processes have access to the table with interned strings:

test.py:

import multiprocessing as mp
import sys
import _string


PROCS = 2
STRING = "https://www.youtube.com/watch?v=dQw4w9WgXcQ"


def worker():
    proc = mp.current_process()
    interned = _string.interned()

    try:
        idx = interned.index(STRING)
    except ValueError:
        s = None
    else:
        s = interned[idx]

    print(f"{proc}: <{s}>")


def main():
    sys.intern(STRING)

    procs = []

    for _ in range(PROCS):
        p = mp.Process(target=worker)
        p.start()
        procs.append(p)

    for p in procs:
        p.join()


if __name__ == "__main__":
    main()

Test:

# python test.py 
<Process name='Process-1' parent=3917 started>: <https://www.youtube.com/watch?v=dQw4w9WgXcQ>
<Process name='Process-2' parent=3917 started>: <https://www.youtube.com/watch?v=dQw4w9WgXcQ>

Patch:

--- Objects/unicodeobject.c 2021-05-15 15:08:05.117433926 +0100
+++ Objects/unicodeobject.c.tmp 2021-05-15 23:48:35.236152366 +0100
@@ -16230,6 +16230,11 @@
     _PyUnicode_FiniEncodings(&tstate->interp->unicode.fs_codec);
 }
 
+static PyObject *
+interned_impl(PyObject *module)
+{
+    return PyDict_Values(interned);
+}
 
 /* A _string module, to export formatter_parser and formatter_field_name_split
    to the string.Formatter class implemented in Python. */
@@ -16239,6 +16244,8 @@
      METH_O, PyDoc_STR("split the argument as a field name")},
     {"formatter_parser", (PyCFunction) formatter_parser,
      METH_O, PyDoc_STR("parse the argument as a format string")},
+    {"interned", (PyCFunction) interned_impl,
+     METH_NOARGS, PyDoc_STR("lookup interned strings")},
     {NULL, NULL}
 };

You may also want to take a look at shared_memory module.

References:

HTF
  • 6,632
  • 6
  • 30
  • 49
  • "When a new child process is created, it gets a copy of the parent's address space so they will use the reduced data dictionary with interned strings.". I think this is the key part, so just to clarify - is the reduced memory usage down to the smaller size of the dict in memory (since keys and values are pointers) after being copied to the child processes' memory, or is there some reason why the OS never needs to copy the intern dict across to the child processes? Is this dict special and hence doesn't need to be reference counted or modified in any other way to read? – Rob Streeting May 17 '21 at 12:55
  • 1
    When the underlying `fork` call returns, the new process has an exact copy of the virtual memory, however any subsequent changes (either by parent or a child) will invoke the CoW mechanism. The `refcount` still take place but now data dict references much less values so less new memory pages needs to be created for the child processes. When I was testing this for `3.000000e+07` items (300 unique), memory usage decreased by 62% (from 3425.767MB to 1280.022MB) for the values and interned dict was only 607MB. I believe keys are not affected because they're not referenced in the child processes. – HTF May 18 '21 at 22:05
  • 1
    You may also want loop over`dict.values()` instead of indexing, it should be faster. – HTF May 18 '21 at 22:07
  • 1
    I’ve been looking into this a bit more and I can still see a high number of shared memory with a low unique number of values. You can set a bounty so perhaps someone else can provide a better explanation for this. – HTF May 19 '21 at 08:57
1

Not an answer, but I thought it was of interest to provide a MWE that does not require an input file. Peak memory usage is much higher when manual interning is turned off, which HTF properly explained in my opinion.

from multiprocessing import Pool
from random import choice
from string import ascii_lowercase
# from sys import intern


def rand_str(length):
    return ''.join([choice(ascii_lowercase) for i in range(length)])


def read_map():
    for value in global_map.values():
        x = value
    y = x + '_'
    return y


global_map = dict()
for i in range(20_000_000):
    # global_map[str(i)] = intern(rand_str(4))
    global_map[str(i)] = rand_str(4)
print("starting processes")
if __name__ == '__main__':
    with Pool(processes=2) as process_pool:
        processes = [process_pool.apply_async(read_map)
                     for process in range(process_pool._processes)]
        for process in processes:
            process.wait()
            print(process.get())
Patol75
  • 4,342
  • 1
  • 17
  • 28