I have a Python 3.6 data processing task that involves pre-loading a large dict for looking up dates by ID for use in a subsequent step by a pool of sub-processes managed by the multiprocessing module. This process was eating up most if not all of the memory on the box, so one optimisation I applied was to 'intern' the string dates being stored in the dict. This reduced the memory footprint of the dict by several GBs as I expected it would, but it also had another unexpected effect.
Before applying interning, the sub-processes would gradually eat more and more memory as they executed, which I believe was down to them having to copy the dict gradually from global memory across to the sub-processes' individual allocated memory (this is running on Linux and so benefits from the copy-on-write behaviour of fork()). Even though I'm not updating the dict in the sub-processes, it looks like read-only access can still trigger copy-on-write through reference counting.
I was only expecting the interning to reduce the memory footprint of the dict, but in fact it stopped the memory usage gradually increasing over the sub-processes lifetime as well.
Here's a minimal example I was able to build that replicates the behaviour, although it requires a large file to load in and populate the dict with and a sufficient amount of repetition in the values to make sure that interning provides a benefit.
import multiprocessing
import sys
# initialise a large dict that will be visible to all processes
# that contains a lot of repeated values
global_map = dict()
with open(sys.argv[1], 'r', encoding='utf-8') as file:
if len(sys.argv) > 2:
print('interning is on')
else:
print('interning is off')
for i, line in enumerate(file):
if i > 30000000:
break
parts = line.split('|')
if len(sys.argv) > 2:
global_map[str(i)] = sys.intern(parts[2])
else:
global_map[str(i)] = parts[2]
def read_map():
# do some nonsense processing with each value in the dict
global global_map
for i in range(30000000):
x = global_map[str(i)]
y = x + '_'
return y
print("starting processes")
process_pool = multiprocessing.Pool(processes=10)
for _ in range(10):
process_pool.apply_async(read_map)
process_pool.close()
process_pool.join()
I ran this script and monitored htop
to see the total memory usage.
interning? | mem usage just after 'starting processes' printed | peak mem usage after that |
---|---|---|
no | 7.1GB | 28.0GB |
yes | 5.5GB | 5.6GB |
While I am delighted that this optimisation seems to have fixed all my memory issues at once, I'd like to understand better why this works. If the creeping memory usage by the sub-processes is down to copy-on-write, why doesn't this happen if I intern the strings?