1

I am trying to perform BFS on Wikipedia pages. I believe I am implementing this correctly and in the best possible way run-time wise (keeping it to one thread), but it is taking quite a long time to find a connection between two articles. Here is my implementation:

marked = set()
queue = deque()
count = 0

def get_wikipedia_page(wiki_link):
    url = BASE_URL + wiki_link
    time.sleep(0.1)
    req = requests.get(url)
    soup = BeautifulSoup(req.text, 'html.parser')
    return soup

def backtrace(parent, start, end):
    boolean = False
    ret = []
    while boolean == False:
        if end in parent:
            ret.append(parent[end])
            end = parent[end]
            if end == start:
                break;
    ret.reverse()
    return ret

def bfs(first_wiki_link, second_wiki_link):
    print('Searching starting from ' + first_wiki_link + ' looking for ' + second_wiki_link + '...')
    parent = {}
    queue.append(first_wiki_link)
    start_time = time.time()
    #current_parent = first_wiki_link
    while queue:
        link = queue.pop()
        current_parent = link
        link_list = list(filter(lambda c: c.get('href') != None and c.get('href').startswith('/wiki/') and c.get('href') != '/wiki/Main_Page' and ':' not in c.get('href'), get_wikipedia_page(link).findAll('a')))
        for link in link_list:
            href = link.get('href')
            if not marked.__contains__(href):
                parent[href] = current_parent
                marked.add(href)
                queue.append(href)
            if href == second_wiki_link:
                ret = backtrace(parent, first_wiki_link, second_wiki_link)
                print("connection found")
                end_time = time.time()
                print("found a connection in: " + str(end_time - start_time) + " seconds.")
                print("the path is " + str(len(ret)) + " pages long")
                print(ret)
                return True

It takes, sometimes, a few minutes before finding a match. Is this expected due to how big wikipedia is? Or am I messing something up here and am performing it in a non optimal way? Or, could it be beautiful soup is running too slow?

  • that `time.sleep(0.1)` might be the cause – Photon Dec 10 '21 at 15:59
  • I highly recommend running your program with [line_profiler](https://github.com/pyutils/line_profiler), it's my favourite Python profiling tool – Ruben Helsloot Dec 10 '21 at 16:00
  • Thanks, I was missed that. I was messing around with time delays to see how much longer it would take. I just ran it from Chicago to Painting - with the 0.1 it took 236 seconds, and when I removed it it was 161. So, a great deal faster but still not as quick as I expected. – s_jack_frost Dec 10 '21 at 16:05
  • @RubenHelsloot thanks for the link, I will do that. – s_jack_frost Dec 10 '21 at 16:06
  • Python will always be slow compared to a compiled language. How many nodes and links do you have? A C++ search through 400,000 nodes and 3.3M links visits every node in 0,5 secs. https://github.com/JamesBremner/PathFinder/wiki/Performance – ravenspoint Dec 10 '21 at 16:17
  • loading multiple webpages is very slow in python as it will block your script whilst it loads the data. You need to use asynchronous programming (or `asyncio`) to load data. See https://stackoverflow.com/questions/9110593/asynchronous-requests-with-python-requests – Tom McLean Dec 10 '21 at 16:21
  • @ravenspoint It is scraping all the wiki links from a page and each of those are nodes - so depending on the articles it passes I think it is touching around 100k+ nodes and that usually takes 5 or so minutes. – s_jack_frost Dec 10 '21 at 16:25
  • @TomMcLean I was trying not to do this asynchronously as I do not want to strain wikipedia's servers.. unless you are talking only about making the beautifulsoup bit done asynchronously? Is that possible? – s_jack_frost Dec 10 '21 at 16:26
  • You can download the whole of wikipedia and run your program from that - https://en.wikipedia.org/wiki/Wikipedia:Database_download#Where_do_I_get_it? – Tom McLean Dec 10 '21 at 16:31

1 Answers1

0

You are doing a dfs effectively, not a bfs. A dfs in a large graph like wikipedia content can take a very long time to look at the close neighbors of a node and will in most likeliness lead to longer times to find a connection, because chances are that those 2 pages are somehow related and thus close.

This part of the code:

    while queue:
        link = queue.pop() # <---- look at this
        current_parent = link
        link_list = list(filter...
        for link in link_list:

This makes it a dfs because you are using the queue like a stack. A bfs needs a FIFO (first in first out) data structure, while you are doing a LIFO (last in first out), which is what dfs does.

This means that the first neighbor of the first page is looked at after you have looked at each and every page on wikipedia.

To solve it:

    while queue:
        link = queue.popleft() # <---- look at this
        current_parent = link
        link_list = list(filter...
        for link in link_list:

This will look at the neighbors first and go forward breadth first search like you intend it to do.

user1984
  • 5,990
  • 2
  • 13
  • 32
  • Wow, thanks a ton this reduced my search from 'Chicago' to 'Painting' from 161 seconds down to 56, and found a path in 2 steps instead of 503. I don't understand how I was using the queue like a stack, though.. can you please explain further? I thought I was adding all the parent's neighbors first before going into any of its children? – s_jack_frost Dec 10 '21 at 16:56
  • Actually, I think I understand. The 'pop' function takes elements from the back instead of the front, while the popleft function takes them from the front? – s_jack_frost Dec 10 '21 at 17:01