0

I'm pretty new to web scraping and to using the BeautifulSoup library in Python so I encountered this problem: I have to download and scrape contents from a large number of web pages, downloading them is not a problem, but when I create a BeautifulSoup object for every page (in order to parse it) my program gets incredibily slow. I'm asking to you if there's a way to reduce this overhead and maybe avoiding to create a different whole new BeautifulSoup object for every new page that I want to analyze. Here's the code I execute:

    for action  in actions[:100]:
        #Here I download the pages I need
        curr_url = base_url+action
        r = requests.get(curr_url,cookies = auth_cookie)
        pages.append(r.content)
        
    
    for p in pages:
        #Here I create BeautifulSoup objects that I'll use later to do my computations
        soup = BeautifulSoup(p, features = "html5lib")
        soups.append(soup)
autistik1
  • 1
  • 1
  • 4
    Did you profile this code? are you positive the bottleneck is `BeautifulSoup` and not the 100 requests you are doing? – DeepSpace Oct 07 '20 at 18:05
  • Yes, the requests are a piece of cake actually and are served in a glimpse (that's why I seprated requests code from "souping" code) – autistik1 Oct 07 '20 at 18:09
  • Does this answer your question? [Speeding up beautifulsoup](https://stackoverflow.com/questions/25539330/speeding-up-beautifulsoup) – Chris Greening Oct 07 '20 at 18:42

1 Answers1

0

A solution I found is to create BeutifulSoup object using 'lxlml' as parser argument since it ensures better performance than the html5 parser.

autistik1
  • 1
  • 1