Links of a page and links of that subpages. Recursion/Threads

Question

I'm making a function that downloads the content of a website, then I look for the links in the site and for each one I call recursevily the same function untill the 7th level. The problem is that this takes a lot of time so i was looking to use a threadpool for manage this calls but i dont know how exactly to divide this tasks into the threadpool.

This is my actual code, without the threadpool.

import requests
import re

url = 'https://masdemx.com/category/creatividad/?fbclid=IwAR0G2AQa7QUzI-fsgRn3VOl5oejXKlC_JlfvUGBJf9xjQ4gcBsyHinYiOt8'


def searchLinks(url,level):
    print("level: "+str(level))
    if(level==3):
        return 0

    response = requests.get(url)
    enlaces = re.findall(r'<a href="(.*?)"',str(response.text))

    for en in enlaces:
        if (en[0] == "/" or en[0]=="#"):
            en= url+en[1:]
        print(en)
        searchLinks(en,level+1)


searchLinks(url,1)

ggorlen · Answer 1 · 2020-10-02T20:22:17.580

For starters, note that this could be a big operation. For example, if each page has an average of only 10 unique links, you're looking at over 10 million requests if you want to recurse 7 layers deep.

Also, I'd use an HTML parsing library like BeautifulSoup instead of regex which is a brittle way to scrape HTML. Avoid printing to stdout which also slows down the works.

As for threading, one approach is to use a work queue. Python's queue class is thread safe, so you can create a pool of worker threads that poll to retrieve URLs from the queue. When a thread gets a URL, it finds all the links on the page and appends the relevant URL (or page data, if you wish) to a global list (which is a thread-safe operation on CPython--for other implementations, use a lock on shared data structures). The URLs are enqueued on the work queue and the process continues.

Threads exit when the level reaches 0 since we're using a BFS rather than a DFS using a stack. The (probably safe) assumption here is that there are more levels of links than the depth.

The parallelism comes from threads blocking waiting for request responses, allowing the CPU to run a different thread whose response arrived to do HTML parsing and queue work.

If you'd like to run on multiple cores to help parallelize the CPU-bound portion of the workload, read this blog post about the GIL and look into spawning processes. But threading alone gets you much of the way to parallelization since the bottleneck is I/O bound (waiting on HTTP requests).

Here's some example code:

import queue
import requests
import threading
import time
from bs4 import BeautifulSoup

def search_links(q, urls, seen):
    while 1:
        try:
            url, level = q.get()
        except queue.Empty:
            continue

        if level <= 0:
            break

        try:
            soup = BeautifulSoup(requests.get(url).text, "lxml")

            for x in soup.find_all("a", href=True):
                link = x["href"]

                if link and link[0] in "#/":
                    link = url + link[1:]

                if link not in seen:
                    seen.add(link)
                    urls.append(link)
                    q.put((link, level - 1))
        except (requests.exceptions.InvalidSchema, 
                requests.exceptions.ConnectionError):
            pass

if __name__ == "__main__":
    levels = 2
    workers = 10
    start_url = "https://masdemx.com/category/creatividad/?fbclid=IwAR0G2AQa7QUzI-fsgRn3VOl5oejXKlC_JlfvUGBJf9xjQ4gcBsyHinYiOt8"
    seen = set()
    urls = []
    threads = []
    q = queue.Queue()
    q.put((start_url, levels))
    start = time.time()
    
    for _ in range(workers):
        t = threading.Thread(target=search_links, args=(q, urls, seen))
        threads.append(t)
        t.daemon = True
        t.start()
    
    for thread in threads:
        thread.join()
    
    print(f"Found {len(urls)} URLs using {workers} workers "
          f"{levels} levels deep in {time.time() - start}s")

Here are a few sample runs on my not-especially-fast machine:

$ python thread_req.py
Found 762 URLs using 15 workers 2 levels deep in 33.625585317611694s
$ python thread_req.py
Found 762 URLs using 10 workers 2 levels deep in 42.211519956588745s
$ python thread_req.py
Found 762 URLs using 1 workers 2 levels deep in 105.16120409965515s

That's a 3x performance boost on this small run. I ran into maximum request errors on larger runs, so this is just a toy example.

Links of a page and links of that subpages. Recursion/Threads

1 Answers1

Linked