Fast retrieval of webpage html in python

Question

I have an application which parses a lot of webpages, for parsing I use beautiful soup and it works fine and I am not looking for a parser replacement, I can see from my own timing and benchmarking that the most time is spent getting the actual html with web request and not in actually parsing it with beautiful soup. This is my code:

import urllib.request
from bs4 import BeautifulSoup as soup


def get_html(url: str):
    req = urllib.request.Request(
        url,
        data=None,
        headers={'User-Agent': 'Chrome/35.0.1916.47'})
    uClient = urllib.request.urlopen(req, context=ssl.SSLContext(ssl.PROTOCOL_TLSv1))
    html = uClient.read()
    uClient.close()
    return html

Now just for testing I timed this (with some random url):

for i in range(20):
    myhtml = get_html(url)

this took me an average of 11.30 seconds, which is super slow, in my application it is possible that I would need hundreds of htmls from urls so obviously I need a faster solution... btw if I add a beautiful soup parser to the loop like this:

for i in range(20):
    myhtml = get_html(url)
    page_soup = soup(html, "html.parser")

this just takes me to an average time of 12.20 seconds so I can definitely say the problem is with the html and not the parser.

Possibly related or helpful [Python urllib2.urlopen() is slow, need a better way to read several urls](https://stackoverflow.com/questions/3472515/python-urllib2-urlopen-is-slow-need-a-better-way-to-read-several-urls) — chickity china chinese chicken, Jan 29 '19 at 22:16
Perhaps the html being served by `url` has a slow server as well ? — Bulkan, Jan 29 '19 at 22:18

Fast retrieval of webpage html in python

0 Answers0