Memory Leak while parsing html page source with BeautifulSoup & Requests

Question

So, the basic idea is to make get request to certain list URLs and parse text from those page sources by removing HTML tags and scripts using beautifulsoup. python version 2.7

The problem, at every request, parser function keep adding memory at every request. size increasing gradually.

def get_text_from_page_source(page_source):
    soup = BeautifulSoup(open(page_source),'html.parser')
#     soup = BeautifulSoup(page_source,"lxml")
    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.decompose()    # rip it out
    # get text
    text = soup.get_text()
    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)

    # print text
    return text

even at local text file for parsing memory leaks. for example:

#request 1
response = requests.get(url,timeout=timeout)
parsed_string_from_html_source = get_text_from_page_source(response.content) #100 MB

#request 2
response = requests.get(url,timeout=timeout)
parsed_string_from_html_source = get_text_from_page_source(response.content) #150 MB
#request 3
response = requests.get(url,timeout=timeout)
parsed_string_from_html_source = get_text_from_page_source(response.content) #300 MB

You can store the response in a temporary file and then read the file one line at a time and process it. — vishal, Aug 17 '18 at 12:09
I'm curious about how you run this code? Is it through some IDE? If so, which? — roganjosh, Aug 17 '18 at 12:13
@serbia99 Yes, I tried both ways. first, parse directly in memory. second, save page source in a text file then parse that file. the same issue occurs — wizard, Aug 17 '18 at 12:15

score 2 · Answer 1 · answered Aug 17 '18 at 12:30

2

You can try to call garbage collector:

import gc
response.close()
response = None
gc.collect()

Also this might help you: Python high memory usage with BeautifulSoup

answered Aug 17 '18 at 12:30

Adelina

10,915
1
38
46

score 0 · Answer 2 · answered Aug 17 '18 at 12:29

0

You could try running soup.decompose right before ending your get_text_from_page_source function to destroy the tree.

And in case you're opening a text file instead of directly feeding the requests contents, as it can be seen here:

soup = BeautifulSoup(open(page_source),'html.parser')

Remember to close it when you are done. To keep it short, you could change that line to:

with open(page_source, 'r') as html_file:
    soup = BeautifulSoup(html_file.read(),'html.parser')

answered Aug 17 '18 at 12:29

Pablo M

326
2
7

Tried, added '.close()' but still, nothing changed. – wizard Aug 17 '18 at 13:22
Did you try to use soup.decompose() when you are done parsing? – Pablo M Aug 17 '18 at 13:26
yes, I did add soup.decompose.(). but nothing changed – wizard Aug 17 '18 at 13:52
Could you execute those 3 requests in reverse order (3,2,1), and share with us that nice memory plot you did? – Pablo M Aug 17 '18 at 14:03
1

that will not make any sense Pablo – wizard Aug 17 '18 at 16:21
I guess it doesn't then. – Pablo M Aug 17 '18 at 17:41

Memory Leak while parsing html page source with BeautifulSoup & Requests

2 Answers2