Python Web scraping: Too slow in execution: How to Optimize for speed

Question

I have written a web scraping program in python. It is working correctly but takes 1.5 hrs to execute. I am not sure how to optimize the code. The logic of the code is every country have many ASN's with the client name. I am getting all the ASN links (for e.g https://ipinfo.io/AS2856) Using Beautiful soup and regex to get the data as JSON.

The output is just a simple JSON.

import urllib.request
import bs4
import re
import json

url = 'https://ipinfo.io/countries'
SITE = 'https://ipinfo.io'


def url_to_soup(url):
   #bgp.he.net is filtered by user-agent
    req = urllib.request.Request(url)
    opener = urllib.request.build_opener()
    html = opener.open(req)
    soup = bs4.BeautifulSoup(html, "html.parser")
    return soup

def find_pages(page):
    pages = []
    for link in page.find_all(href=re.compile('/countries/')):
        pages.append(link.get('href'))
    return pages

def get_each_sites(links):
    mappings = {}

    print("Scraping Pages for ASN Data...")

for link in links:
    country_page = url_to_soup(SITE + link)
    current_country = link.split('/')[2]
    for row in country_page.find_all('tr'):
        columns = row.find_all('td')
        if len(columns) > 0:
            #print(columns)
            current_asn = re.findall(r'\d+', columns[0].string)[0]
            print(SITE + '/AS' + current_asn)
            s = str(url_to_soup(SITE + '/AS' + current_asn))
            asn_code, name = re.search(r'(?P<ASN_CODE>AS\d+) (?P<NAME>[\w.\s(&amp;)]+)', s).groups()
            #print(asn_code[2:])
            #print(name)
            country = re.search(r'.*href="/countries.*">(?P<COUNTRY>.*)?</a>', s).group("COUNTRY")
            print(country)
            registry = re.search(r'Registry.*?pb-md-1">(?P<REGISTRY>.*?)</p>', s, re.S).group("REGISTRY").strip()
            #print(registry)
            # flag re.S make the '.' special character match any character at all, including a newline;
            mtch = re.search(r'IP Addresses.*?pb-md-1">(?P<IP>.*?)</p>', s, re.S)
            if mtch:
                ip = mtch.group("IP").strip()
            #print(ip)
            mappings[asn_code[2:]] = {'Country': country,
                                      'Name': name,
                                      'Registry': registry,
                                      'num_ip_addresses': ip}

    return mappings

main_page = url_to_soup(url)
country_links = find_pages(main_page)
#print(country_links)
asn_mappings = get_each_sites(country_links)
print(asn_mappings)

The output is as expected, but super slow.

Out of curiosity, how much time does it take to parse a single page? — Etienne de Martel, Jan 15 '19 at 18:41
Please submit python version you use. Python2.7 and Python3 provides very different tools for your case — Alex Yu, Jan 15 '19 at 18:42
The code is running for 2 hrs on a different machine. I don't know whether the issue is with RAM or Internet speed. — Jason Brown, Jan 15 '19 at 18:58
If it takes a few seconds, and there are thousands of ASNs to parse, it's going to take thousands of seconds to do the whole thing. It's just math. — Etienne de Martel, Jan 15 '19 at 19:01
yes I get the math, but there has to be some way to optimize the code or parse the page in parallel. I know there is about 55000 pages, what I am trying to do is some multi thred / parallel execution — Jason Brown, Jan 15 '19 at 19:06

score 0 · Answer 1 · edited Jun 08 '21 at 11:59

0

I think what you need is to do multiple processes of the scraping . This can be done using the python multiprocessing package. Since multi threads programs do not work in python because of the GIL (Global Interpreter Lock). There are plenty of examples of how to do this. Here are some:

edited Jun 08 '21 at 11:59

DisappointedByUnaccountableMod

6,656
4
18
22

answered Jan 15 '19 at 20:10

Milton Arango G

775
9
16

score 0 · Answer 2 · edited Jun 08 '21 at 11:58

You probably don't want to speed your scraper up. When you scrape a site, or connect in a way that humans don't (24/7), it's good practice to keep requests to a minium so that

You blend in the background noise
You don't (D)DoS the website in hope of finishing faster, while racking up costs for the wbesite owner

What you can do, however, is get the AS names and numbers from this website (see this SO answers), and recover the IPs using PyASN

Python Web scraping: Too slow in execution: How to Optimize for speed

2 Answers2