For starters, note that this could be a big operation. For example, if each page has an average of only 10 unique links, you're looking at over 10 million requests if you want to recurse 7 layers deep.
Also, I'd use an HTML parsing library like BeautifulSoup instead of regex which is a brittle way to scrape HTML. Avoid printing to stdout which also slows down the works.
As for threading, one approach is to use a work queue. Python's queue class is thread safe, so you can create a pool of worker threads that poll to retrieve URLs from the queue. When a thread gets a URL, it finds all the links on the page and appends the relevant URL (or page data, if you wish) to a global list (which is a thread-safe operation on CPython--for other implementations, use a lock on shared data structures). The URLs are enqueued on the work queue and the process continues.
Threads exit when the level reaches 0 since we're using a BFS rather than a DFS using a stack. The (probably safe) assumption here is that there are more levels of links than the depth.
The parallelism comes from threads blocking waiting for request responses, allowing the CPU to run a different thread whose response arrived to do HTML parsing and queue work.
If you'd like to run on multiple cores to help parallelize the CPU-bound portion of the workload, read this blog post about the GIL and look into spawning processes. But threading alone gets you much of the way to parallelization since the bottleneck is I/O bound (waiting on HTTP requests).
Here's some example code:
import queue
import requests
import threading
import time
from bs4 import BeautifulSoup
def search_links(q, urls, seen):
while 1:
try:
url, level = q.get()
except queue.Empty:
continue
if level <= 0:
break
try:
soup = BeautifulSoup(requests.get(url).text, "lxml")
for x in soup.find_all("a", href=True):
link = x["href"]
if link and link[0] in "#/":
link = url + link[1:]
if link not in seen:
seen.add(link)
urls.append(link)
q.put((link, level - 1))
except (requests.exceptions.InvalidSchema,
requests.exceptions.ConnectionError):
pass
if __name__ == "__main__":
levels = 2
workers = 10
start_url = "https://masdemx.com/category/creatividad/?fbclid=IwAR0G2AQa7QUzI-fsgRn3VOl5oejXKlC_JlfvUGBJf9xjQ4gcBsyHinYiOt8"
seen = set()
urls = []
threads = []
q = queue.Queue()
q.put((start_url, levels))
start = time.time()
for _ in range(workers):
t = threading.Thread(target=search_links, args=(q, urls, seen))
threads.append(t)
t.daemon = True
t.start()
for thread in threads:
thread.join()
print(f"Found {len(urls)} URLs using {workers} workers "
f"{levels} levels deep in {time.time() - start}s")
Here are a few sample runs on my not-especially-fast machine:
$ python thread_req.py
Found 762 URLs using 15 workers 2 levels deep in 33.625585317611694s
$ python thread_req.py
Found 762 URLs using 10 workers 2 levels deep in 42.211519956588745s
$ python thread_req.py
Found 762 URLs using 1 workers 2 levels deep in 105.16120409965515s
That's a 3x performance boost on this small run. I ran into maximum request errors on larger runs, so this is just a toy example.