I'm pretty new to web scraping and to using the BeautifulSoup library in Python so I encountered this problem: I have to download and scrape contents from a large number of web pages, downloading them is not a problem, but when I create a BeautifulSoup object for every page (in order to parse it) my program gets incredibily slow. I'm asking to you if there's a way to reduce this overhead and maybe avoiding to create a different whole new BeautifulSoup object for every new page that I want to analyze. Here's the code I execute:
for action in actions[:100]:
#Here I download the pages I need
curr_url = base_url+action
r = requests.get(curr_url,cookies = auth_cookie)
pages.append(r.content)
for p in pages:
#Here I create BeautifulSoup objects that I'll use later to do my computations
soup = BeautifulSoup(p, features = "html5lib")
soups.append(soup)