I have just seen my Python process get killed again on my VPS with 1GB of RAM and am now looking into optimizing memory usage in my program.
I've got a function that downloads a web page, looks for data and then returns a Pandas Dataframe with what it found. This function is called thousands of times from within a for loop that ends up maxing out the memory on my server.
Line # Mem usage Increment Occurences Line Contents
============================================================
93 75.6 MiB 1.2 MiB 1 page = http.get(url)
94 75.6 MiB 0.0 MiB 1 if page.status_code == 200:
95 78.4 MiB 2.8 MiB 1 tree = html.fromstring(page.text)
96 78.4 MiB 0.0 MiB 1 del page
... code to search for data using xpaths and assign to data dict
117 78.4 MiB 0.1 MiB 1 df = pd.DataFrame(data)
118 78.4 MiB 0.0 MiB 1 del tree
119 78.4 MiB 0.0 MiB 1 gc.collect()
120 78.4 MiB 0.0 MiB 1 return df
With the memory_profiler results from above, it shows that the lines of my code with the largest memory increments are as expected: the http.get()
and html.fromstring()
calls and assignments. The actual Dataframe creation is much smaller in comparison.
Now I would expect that the only overall memory increase to my program is the size of the dataframe returned by the function and not ALSO the size of both the page
and tree
objects. With every call to this function, the memory increase on my program is the combination of all 3 objects and this does not ever decrease.
I have tried adding del
before the end of the function to attempt to de-reference the objects I don't need anymore, but this does not seem to make a difference.
I do see that for a scalable application I would need to start saving to disk, but at this point even if I do save to disk I'm not sure how to free up the memory already used.
Thanks for your help