I recently coded a program that finds the shortest path between two wikipedia articles. The problem is getting ALL the links from a page and putting them into a graph is taking a long time. Finding the path is the easy part. Basicly what I'm doing is this:
startingPage = 'Lisbon'
target = 'Adolf Hitler'
graph = nx.DiGraph()
graph.add_node(startingPage)
found = pages.import_page(graph, startingPage)
while found != True:
for node in list(graph):
if graph.out_degree(node) == 0:
found = pages.import_page(graph, node)
if found == True:
break;
And my import_page function is this one:
def import_page(graph, starting, target):
general_str = 'https://en.wikipedia.org/w/api.php?action=query&prop=links&pllimit=max&format=json&titles='
data_str = general_str + starting
encoded_data_str = data_str.encode('utf-8') #Sanitize input
response = url.urlopen(encoded_data_str)
data = json.loads(response.read())
pageId = data['query']['pages'].keys()
print starting
if data['query']['pages'].keys()[0] == '-1': #Check if the page doesn't exist in Wikipedia
return False
elif data['query']['pages'][pageId[0]].keys()[2] != 'links': #Check if the page has no links in it
return False
for jsonObject in data['query']['pages'][pageId[0]]['links']:
graph.add_node(jsonObject['title'])
graph.add_edge(starting, jsonObject['title'])
if jsonObject['title'] == target:
return True
while data.keys()[0] != 'batchcomplete':
continueId = data['continue']['plcontinue']
continue_str = data_str + '&plcontinue=' + continueId
encoded_continue_str = continue_str.encode('utf-8') #Sanitize input
response = url.urlopen(encoded_continue_str)
data = json.loads(response.read())
for jsonObject in data['query']['pages'][pageId[0]]['links']:
graph.add_node(jsonObject['title'])
graph.add_edge(starting, jsonObject['title'])
if jsonObject['title'] == target:
return True
return False
The problem is for any distance bigger than 2/3 links it is a taking an immense amount of time. Any ideas on how I can speed it up?