0

I have just seen my Python process get killed again on my VPS with 1GB of RAM and am now looking into optimizing memory usage in my program.

I've got a function that downloads a web page, looks for data and then returns a Pandas Dataframe with what it found. This function is called thousands of times from within a for loop that ends up maxing out the memory on my server.

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
93     75.6 MiB      1.2 MiB           1           page = http.get(url)                
94     75.6 MiB      0.0 MiB           1           if page.status_code == 200:                        
95     78.4 MiB      2.8 MiB           1               tree = html.fromstring(page.text)
96     78.4 MiB      0.0 MiB           1               del page 

... code to search for data using xpaths and assign to data dict

117     78.4 MiB      0.1 MiB           1               df = pd.DataFrame(data)                        
118     78.4 MiB      0.0 MiB           1               del tree
119     78.4 MiB      0.0 MiB           1               gc.collect()
120     78.4 MiB      0.0 MiB           1               return df

With the memory_profiler results from above, it shows that the lines of my code with the largest memory increments are as expected: the http.get() and html.fromstring() calls and assignments. The actual Dataframe creation is much smaller in comparison.

Now I would expect that the only overall memory increase to my program is the size of the dataframe returned by the function and not ALSO the size of both the page and tree objects. With every call to this function, the memory increase on my program is the combination of all 3 objects and this does not ever decrease.

I have tried adding del before the end of the function to attempt to de-reference the objects I don't need anymore, but this does not seem to make a difference.

I do see that for a scalable application I would need to start saving to disk, but at this point even if I do save to disk I'm not sure how to free up the memory already used.

Thanks for your help

splotsh
  • 53
  • 1
  • 8

1 Answers1

0

After a lot of digging, I finally found the answer to my own question. The issue was related to string results from my xpath expressions that by default use "smart-strings" that are known to eat up memory. Disabling these gives me the kind of memory consumption I was expecting.

More information lxml parser eats all memory and https://lxml.de/xpathxslt.html#xpath-return-values

splotsh
  • 53
  • 1
  • 8