2

I have a simple script that attempts to stress the concurrent.futures library as follows:

#! /usr/bin/python

import psutil
import gc
import os
from concurrent.futures import ThreadPoolExecutor

WORKERS=2**10

def run():
        def x(y):
                pass

        with ThreadPoolExecutor(max_workers=WORKERS) as pool:
                for _ in pool.map(x, [i for i in range(WORKERS)]):
                        pass

if __name__ == '__main__':
        print('%d objects' % len(gc.get_objects()))
        print('RSS: %s kB' % (psutil.Process(os.getpid()).get_memory_info().rss / 2**10))
        run()
        print('%d objects' % len(gc.get_objects()))
        print('RSS: %s kB' % (psutil.Process(os.getpid()).get_memory_info().rss / 2**10))

Which ends up producing the following output on a 2 core linux machine running python 2.7:

# time ./test.py
7048 objects
RSS: 11968 kB
6749 objects
RSS: 23256 kB

real    0m1.077s
user    0m0.875s
sys     0m0.316s

Although this is a bit of a contrived example, I'm struggling to understand why the RSS increases in this situation and what the allocated memory is being used for.

Linux should handle the forked memory fairly well with COW, but since CPython is reference-counted, portions of the inherited memory would not be truly read-only because the reference needs to be updated. Considering how minimal the reference count overhead is, the 12MB increase is surprising to me. If instead of using the ThreadPoolExecutor I just spawn daemon threads using the threading library, the RSS will only increase by 4MB.

It is definitely unclear to me whether to suspect the CPython allocator or the glibc allocator at this point, but my understanding is that the latter should presumably handle this flavor of concurrency and be able to reuse arenas for allocations across the spawned threads.

I'm using the backported version of concurrent.futures 3.0.3 under python 2.7.9 with glibc 2.4 on a 4.1 kernel. Any advice or hints on how to investigate this further would be greatly appreciated.

Alex M
  • 132
  • 6

2 Answers2

1

I suggest you to read this reply from https://stackoverflow.com/a/1718522/5632150

As he said, the number of threads you can spawn depends on the fact that your threads do or do not any I/O operation. If so there are some ways to optimize this problem. If not I usually do MAX_THREADS = N_CORES + 1.

not sure but, are you trying to spawn 1024 thread on one core?

MrGoodKat
  • 41
  • 6
  • Thank you for posting the link. In general, I try to follow your practice of choosing the number of cores as the number of workers. I'm mostly trying to understand the tradeoffs of exceeding that heuristic, and moreso why the memory is allocated this way. – Alex M Aug 31 '17 at 02:05
1

Most memory allocators don't return all their memory to the OS.

Try calling run() twice and checking the RSS before/after the second time.

(That said, ludicrous numbers of threads are generally not a good idea)

o11c
  • 15,265
  • 4
  • 50
  • 75
  • Agreed, this is mostly a thought experiment. Following your advice, I wrapped the last 3 lines of the script into a `for i in range(8):` and produced a relatively consistent memory usage. `RSS: 11904 kB, RSS: 23272 kB, RSS: 24128 kB, RSS: 24468 kB, RSS: 24180 kB, RSS: 24476 kB, RSS: 24508 kB, RSS: 24492 kB, RSS: 24200 kB` My final question would be, is there any caching at play here or is it simply reusing the previous allocations? – Alex M Aug 31 '17 at 02:12
  • I doubt that there is much caching outside the memory allocator, but I don't know for sure. TLS in particular is very weird. – o11c Aug 31 '17 at 15:05