Reduce overhead when creating BeautifulSoup objects

Question

I'm pretty new to web scraping and to using the BeautifulSoup library in Python so I encountered this problem: I have to download and scrape contents from a large number of web pages, downloading them is not a problem, but when I create a BeautifulSoup object for every page (in order to parse it) my program gets incredibily slow. I'm asking to you if there's a way to reduce this overhead and maybe avoiding to create a different whole new BeautifulSoup object for every new page that I want to analyze. Here's the code I execute:

    for action  in actions[:100]:
        #Here I download the pages I need
        curr_url = base_url+action
        r = requests.get(curr_url,cookies = auth_cookie)
        pages.append(r.content)
        
    
    for p in pages:
        #Here I create BeautifulSoup objects that I'll use later to do my computations
        soup = BeautifulSoup(p, features = "html5lib")
        soups.append(soup)

Did you profile this code? are you positive the bottleneck is `BeautifulSoup` and not the 100 requests you are doing? — DeepSpace, Oct 07 '20 at 18:05
Yes, the requests are a piece of cake actually and are served in a glimpse (that's why I seprated requests code from "souping" code) — autistik1, Oct 07 '20 at 18:09
Does this answer your question? [Speeding up beautifulsoup](https://stackoverflow.com/questions/25539330/speeding-up-beautifulsoup) — Chris Greening, Oct 07 '20 at 18:42

score 0 · Answer 1 · answered Oct 07 '20 at 18:43

0

A solution I found is to create BeutifulSoup object using 'lxlml' as parser argument since it ensures better performance than the html5 parser.

answered Oct 07 '20 at 18:43

autistik1

1
1

Reduce overhead when creating BeautifulSoup objects

1 Answers1