I basically have the same problem as the guy here: Python high memory usage with BeautifulSoup
My BeautifulSoup objects are not garbage collected, resulting in an important RAM consumption. Here is the code I use ("entry" is an object I get from a RSS web page. It is basically an RSS article).
title = entry.title
date = arrow.get(entry.updated).format('YYYY-MM-DD')
try:
url = entry.feedburner_origlink
except AttributeError:
url = entry.link
abstract = None
graphical_abstract = None
author = None
soup = BeautifulSoup(entry.summary)
r = soup("img", align="center")
print(r)
if r:
graphical_abstract = r[0]['src']
if response.status_code is requests.codes.ok:
soup = BeautifulSoup(response.text)
# Get the title (w/ html)
title = soup("h2", attrs={"class": "alpH1"})
if title:
title = title[0].renderContents().decode().lstrip().rstrip()
# Get the abstrat (w/ html)
r = soup("p", xmlns="http://www.rsc.org/schema/rscart38")
if r:
abstract = r[0].renderContents().decode()
if abstract == "":
abstract = None
r = soup("meta", attrs={"name": "citation_author"})
if r:
author = [tag['content'] for tag in r]
author = ", ".join(author)
So in the doc (http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Improving%20Memory%20Usage%20with%20extract) they say the problem can come from the fact that, as long as you use a tag contained in the soup object, the soup object stays in memory. So I tried something like that (for every time I use a soup object in the previous example):
r = soup("img", align="center")[0].extract()
graphical_abstract = r['src']
But still, the memory is not freed when the program exits the scope.
So, I'm looking for an efficient way to delete a soup object from memory. Do you have any idea ?