I run a server which acts as a data processing node for clients within the team. Recently we've been refactoring legacy code within the server to leverage numpy for some of the filtering/transform jobs.
As we have to serve this data out to remote clients, we convert the numpy data to various forms, using numpy.tolist() as an intermediary step.
Each query is stateless, there are no globals, and so between queries no references are be maintained.
In one particular step I get an apparent memory leak, which I have been trying to trackdown via memory_profiler. This step involves converting a large (>4m entries) ndarray of floats to a python list. The first time I issue the query the tolist() call allocates 120m of memory, and then deallocates 31m when I release the numpy array. The second (and subsequent times) I issue the identical query the allocation/deallocation is 31m. Each different query I issue has the same pattern, though with different absolute values.
I've torn apart my code, and forced in some del commands for illustrative purposes. Output, below, is from memory_profiler.profile
First issue of query:
Line # Mem usage Increment Line Contents
================================================
865 296.6 MiB 0.0 MiB p = ikeyData[1]['value']
866 417.2 MiB 120.6 MiB newArr = p.tolist()
867 417.2 MiB 0.0 MiB del p
868 385.6 MiB -31.6 MiB del ikeyData[1]['value']
869 385.6 MiB 0.0 MiB ikeyData[1]['value'] = newArr
Second(and subsequent) instances of same query:
Line # Mem usage Increment Line Contents
================================================
865 494.7 MiB 0.0 MiB p = ikeyData[1]['value']
866 526.3 MiB 31.6 MiB newArr = p.tolist()
867 526.3 MiB 0.0 MiB del p
868 494.7 MiB -31.6 MiB del ikeyData[1]['value']
869 494.7 MiB 0.0 MiB ikeyData[1]['value'] = newArr
As you can imagine, in a long-running process with highly variable queries, these allocations build up forcing us to regularly bounce the server.
Does anyone have thoughts as to what might be happening here?