I have an application which parses a lot of webpages, for parsing I use beautiful soup and it works fine and I am not looking for a parser replacement, I can see from my own timing and benchmarking that the most time is spent getting the actual html with web request and not in actually parsing it with beautiful soup. This is my code:
import urllib.request
from bs4 import BeautifulSoup as soup
def get_html(url: str):
req = urllib.request.Request(
url,
data=None,
headers={'User-Agent': 'Chrome/35.0.1916.47'})
uClient = urllib.request.urlopen(req, context=ssl.SSLContext(ssl.PROTOCOL_TLSv1))
html = uClient.read()
uClient.close()
return html
Now just for testing I timed this (with some random url):
for i in range(20):
myhtml = get_html(url)
this took me an average of 11.30 seconds, which is super slow, in my application it is possible that I would need hundreds of htmls from urls so obviously I need a faster solution... btw if I add a beautiful soup parser to the loop like this:
for i in range(20):
myhtml = get_html(url)
page_soup = soup(html, "html.parser")
this just takes me to an average time of 12.20 seconds so I can definitely say the problem is with the html and not the parser.