How to speed up program that finds the shortest path between two wikipedia articles

Question

I recently coded a program that finds the shortest path between two wikipedia articles. The problem is getting ALL the links from a page and putting them into a graph is taking a long time. Finding the path is the easy part. Basicly what I'm doing is this:

startingPage = 'Lisbon'
target = 'Adolf Hitler'
graph = nx.DiGraph()
graph.add_node(startingPage)
found = pages.import_page(graph, startingPage)

while found != True:
    for node in list(graph):
        if graph.out_degree(node) == 0:
            found = pages.import_page(graph, node)
        if found == True:
            break;

And my import_page function is this one:

def import_page(graph, starting, target):
    general_str = 'https://en.wikipedia.org/w/api.php?action=query&prop=links&pllimit=max&format=json&titles='
    data_str = general_str + starting
    encoded_data_str = data_str.encode('utf-8') #Sanitize input
    response = url.urlopen(encoded_data_str)
    data = json.loads(response.read())
    pageId = data['query']['pages'].keys()
    print starting
    if data['query']['pages'].keys()[0] == '-1': #Check if the page doesn't exist in Wikipedia
        return False
    elif data['query']['pages'][pageId[0]].keys()[2] != 'links': #Check if the page has no links in it
        return False

    for jsonObject in data['query']['pages'][pageId[0]]['links']:

        graph.add_node(jsonObject['title'])
        graph.add_edge(starting, jsonObject['title'])

        if jsonObject['title'] == target:
            return True

    while data.keys()[0] != 'batchcomplete':

        continueId = data['continue']['plcontinue']
        continue_str = data_str + '&plcontinue=' + continueId
        encoded_continue_str = continue_str.encode('utf-8') #Sanitize input
        response = url.urlopen(encoded_continue_str)
        data = json.loads(response.read())

        for jsonObject in data['query']['pages'][pageId[0]]['links']:
            graph.add_node(jsonObject['title'])
            graph.add_edge(starting, jsonObject['title'])
            if jsonObject['title'] == target:
                return True

    return False

The problem is for any distance bigger than 2/3 links it is a taking an immense amount of time. Any ideas on how I can speed it up?

Why do you care about the edges? Don't you just want to keep a list of active nodes and a list of visited nodes? Another concern (which might just be caused by me not knowing the graph library): Is it valid to modify the graph while you iterate over its nodes? — Nico Schertler, Nov 29 '16 at 23:37
Also, the way it is written here, your `import_page` is really `import_Adolf_Hitler_page` :) — 500 - Internal Server Error, Nov 30 '16 at 00:33
@NicoSchertler Well I need edges to connect pages that link to each other right? Else how would I know that two pages have a link between them? — Joaquim Verdasca, Nov 30 '16 at 00:40
@NicoSchertler Also, the library doesnt allow modifying the graph but thats why I'm iterating over list(graph). I create a list just for iterating. — Joaquim Verdasca, Nov 30 '16 at 00:41
I meant, it is enough to have the edges implicit. You are doing BFS, right? So you are just interested in the front of vertices. The edges that led to this front are irrelevant. And you don't need the graph data structure for anything. So just keep two lists/sets. Anyhow, you are probably limited by the network. So it might help to make multiple requests at the same time. — Nico Schertler, Nov 30 '16 at 00:44
It would have been more helpful to explain the algorithm in words (or just say "I am doing breadth-first search") instead of just dumping a page of code. — Tgr, Dec 04 '16 at 21:48

score 1 · Answer 1 · answered Dec 04 '16 at 21:58

Finding the shortest path with certainty is practically impossible with a simple algorithm and a web API. If the shortest path has N steps, you need to walk every possible path with length N-1 or less to be sure. With millions of articles and dozens to hundreds of links from each, that's unfeasible unless you are really lucky and the shortest path is just 1-2 links. If it's say 10 steps away, you'd have to make billions of requests, which would take years.

If you just want to find a reasonably short path most of the time, you can try something like an A* search algorithm with a good heuristic. For example, you could hypothesize some sort of small-world property and try to identify topic hubs which are close to other topic hubs and also to all articles in that topic. Or you can score candidates on being on the same topic, or in the same historic period as the target.

score 1 · Accepted Answer · edited May 23 '17 at 12:30

I used an approach as @Tgr point out, exploiting a small world. If you use a weighted network, you can limit the search to a subgraph sufficiently large to encompass relevant hubs, and small enough to be handled in a web RESTful API.

You may want to check out iGraph module rather networkx, for less memory footprint.

With the approach I suggested to you I have been able to obtain shortest paths connecting up to 5 queried wikipedia articles, with a memory footprint of up to 100MB of sub-graph created in real time. A shortest path between two topics takes less than 1s.

I would be happy to share a link to my project, which actually compute a weighted knowledge networks for the wikipedia to allow search for connections between multiple topics - would it break SO policy or could be useful for the OP and discussion over his question?

EDIT

Thank you @Tgr for debriefing on the policy.

Nifty.works is a prototype platform to search for connections between inter-disciplinary fields. The knowledge graph is a subset of Wikidata paired with English Wikipedia.

As an example for the OP, this example shows shortest paths queried between five Wikipedia articles: subgraph for connections between articles: "Shortest Path Problem", "A star search", "networkx", "knowledge graph" and "semantic network"

I computed the knowledge graph of Wikipedia as a weighted network. The network has small-world properties. A query for connections (paths) in between of articles is made by delimiting a portion of the knowledge graph (a sub-graph).

With this approach it is possible to serve a graph search fast enough to provide insights in knowledge discovery, even on small server machines.

Here you find examples of gamification of shortest paths between two articles of English Wikipedia, each pair has a distance bigger than 3 links - that is, they are not first neighbours: e.g. "Machine Learning" and "Life" -here a json of the queried subgraph).

You might even want to add parameters to adjust the size of your weighted sub-graph, so to obtain different results. As an example, see the differences between:

Finally, also look at this question: https://stackoverflow.com/a/16030045/305883

Sharing your own work with a clear disclosure when it is on topic is absolutely in line with SO policy. ([See FAQ.](http://stackoverflow.com/help/promotion)) — Tgr, Dec 05 '16 at 01:39

How to speed up program that finds the shortest path between two wikipedia articles

2 Answers2

Linked