0

I'm making a wikipedia crawler but it's very slow. How can I make it faster?

I'm using requests module and beautifulsoup4 to parse the html pages. I've tried implementing multithreading but it's still slow.

import requests
from bs4 import BeautifulSoup as bs
from queue import Queue

baseURL = "https://en.wikipedia.org";

startURL = "/wiki/French_battleship_Courbet_(1911)"
endURL = "/wiki/Royal_Navy"

tovisit = Queue()
visited = []

def main():


    if (not checkValid(startURL)) or (not checkValid(endURL)):
        print("Invalid URLs entered.")
        quit()

    initCrawler(startURL)

def initCrawler(startURL):

    global tovisit
    global visited

    tovisit.put(startURL)

    finished = False

    while not finished:

        if tovisit.empty():
            finished = True
            continue

        url = tovisit.get()

        childlinks = linkCrawl(url)

        for i in childlinks:
            tovisit.put(i)

        visited.append(url)

def linkCrawl(url):

    global visited
    global tovisit
    global endURL

    print("crawling "+ url + "\n")

    r = requests.get(baseURL+url)
    soup = bs(r.content, "html.parser")

    rawlinks = soup.find_all('a', href=True)

    refinedlinks = []

    for rawLink in rawlinks:
        i = rawLink["href"]
        if i is None:
            continue
        # ensure what we have is a string
        if not (type(i) is str):
            continue
        # no poi
        if i in visited:
            continue
        if i in list(tovisit.queue):
            continue
        if not checkValid(i):
            continue
        if i == endURL:
            print("yay")
            exit()
        refinedlinks.append(i)

    return refinedlinks

def checkValid(url):
    if not url.startswith("/wiki/"):
        return False
    if url.startswith("/wiki/Special:"):
        return False
    if url.startswith("/wiki/Wikipedia:"):
        return False
    if url.startswith("/wiki/Portal:"):
        return False
    if url.startswith("/wiki/File:"):
        return False
    if url.endswith("(disambiguation)"):
        return False
    return True

if __name__ == "__main__":
    main()

I expect the bot to run faster, but it's actually slow. Research says that eventually multithreading won't be enough.

Ravi Ghaghada
  • 21
  • 1
  • 2
  • 1
    Welcome to stack overflow! Why is visited a list? If you used a dictionary, I bet that would speed things up a lot. List is NOT a good data structure for lookups. Try reading about python data structures a bit more before you go into writing a web crawler. You also read it well, even multi threading might not be enough at some point and you might need more than 1 machine for a serious crawler, but for starters, what you have should be enough. https://stackoverflow.com/questions/102631/how-to-write-a-crawler – darxsys Apr 16 '19 at 18:03
  • You can use Scrapy. Scrapy is a fast, open-source web crawling framework written in Python, used to extract the data from the web page with the help of selectors based on XPath. A fast high-level crawler. It has good balancing multithreading and Load balancing. https://scrapy.org/ – Shariful Islam Apr 16 '19 at 18:12
  • Try use [asyncio](https://docs.python.org/3/library/asyncio.html) + [aiohttp](https://aiohttp.readthedocs.io/en/stable/) This is a crawler [example](https://github.com/aio-libs/aiohttp/blob/master/examples/legacy/crawl.py) – Kamrus Apr 16 '19 at 18:43

0 Answers0