Python high memory usage with BeautifulSoup: can't delete object

Question

I basically have the same problem as the guy here: Python high memory usage with BeautifulSoup

My BeautifulSoup objects are not garbage collected, resulting in an important RAM consumption. Here is the code I use ("entry" is an object I get from a RSS web page. It is basically an RSS article).

title = entry.title
date = arrow.get(entry.updated).format('YYYY-MM-DD')

try:
    url = entry.feedburner_origlink
except AttributeError:
    url = entry.link

abstract = None
graphical_abstract = None
author = None

soup = BeautifulSoup(entry.summary)

r = soup("img", align="center")
print(r)
if r:
    graphical_abstract = r[0]['src']

if response.status_code is requests.codes.ok:
    soup = BeautifulSoup(response.text)

    # Get the title (w/ html)
    title = soup("h2", attrs={"class": "alpH1"})
    if title:
        title = title[0].renderContents().decode().lstrip().rstrip()

    # Get the abstrat (w/ html)
    r = soup("p", xmlns="http://www.rsc.org/schema/rscart38")
    if r:
        abstract = r[0].renderContents().decode()
        if abstract == "":
            abstract = None

    r = soup("meta", attrs={"name": "citation_author"})
    if r:
        author = [tag['content'] for tag in r]
        author = ", ".join(author)

So in the doc (http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Improving%20Memory%20Usage%20with%20extract) they say the problem can come from the fact that, as long as you use a tag contained in the soup object, the soup object stays in memory. So I tried something like that (for every time I use a soup object in the previous example):

    r = soup("img", align="center")[0].extract()
    graphical_abstract = r['src']

But still, the memory is not freed when the program exits the scope.

So, I'm looking for an efficient way to delete a soup object from memory. Do you have any idea ?

Have you tried **lxml**? It's `iterparse` is very efficient for large document parsing, have a look [here](http://lxml.de/api/lxml.etree.iterparse-class.html) — Anzel, Apr 24 '15 at 16:36
I know lxml, but I prefer BeautifulSoup. And I have a full module coded with BS. It works, except for the memory leak part. — JPFrancoia, Apr 24 '15 at 16:40

score 2 · Answer 1 · answered Apr 03 '16 at 11:42

To avoid great memory leak of BeautifulSoup objects try to use SoupStrainer class.

It worked perfectly for me.

from bs4 import SoupStrainer

only_span = SoupStrainer('span')
only_div = SoupStrainer('div')
only_h1 = SoupStrainer('h1')

soup_h1 = BeautifulSoup(response.text, 'lxml', parse_only=only_h1)
soup_span = BeautifulSoup(response.text, 'lxml', parse_only=only_span)
soup_div = BeautifulSoup(response.text, 'lxml', parse_only=only_div)


try:
    name = soup_h1.find('h1', id='itemTitle').find(text=True, recursive=False)
except:
    name = 'Noname'

try:
    price = soup_span.find('span', id='prcIsum').text.strip()

etc...

Even if we create three BeautifulSoup objects with using SoupStrainer it'll consume much less RAM, than without SoupStrainer and using only one BeautifulSoup object.

score 1 · Answer 2 · answered May 06 '15 at 09:52

1

I have had a similar issue and found out that despite my attention I was still storing some BS NavigableString and/or ResultSet which led the soup to stay in memory as you already know. Not sure if both are useful (I let you try) but I remember that extracting text this way fixed the problem

ls_result = [unicode(x) for x in soup_bloc.findAll(text = True)]
str_result = unicode(soup_bloc.text)

answered May 06 '15 at 09:52

etna

1,083
7
13

So basically, each time I need a string from a soup object, I just need to call the unicode function on it, right ? I don't need to do anything special while browsing/searching the tree ? – JPFrancoia May 09 '15 at 12:33
In my case it was enough. I had also played with gc and decompose() as suggested in the SO question you mentioned but it did not help. In the end, I found by problem by methodically checking the type of each thing I stored (including the type of what I believed to be lists and turned out to be a BS ResultSets and of items in lists which I believed to be string and turned out to be BS NavigableStrings). I guess your issue could be different though. I don't mind checking if you post a snippet that I can run. – etna May 10 '15 at 18:49

Python high memory usage with BeautifulSoup: can't delete object

2 Answers2