11

The question:

Find shortest path between two articles in english Wikipedia. Path between article A and B exist if there are articles C(i) and there is a link in article A that leads to article C(1), in article C(1) link that leads to article C(2), ..., in article C(n) is link that leads to article B

I'm using Python. URL to download wikipedia article:

  1. http://en.wikipedia.org/wiki/Nazwa_artykułu
  2. http://en.wikipedia.org/w/index.php?title?Nazwa_artykułu&printable=yes
  3. Wikipedia API

I have edited my source code, but it still does not work when I include those articles in codes can any one tell me what am I messing here?

This is my code:

import urllib2
import re
import xml.etree.ElementTree as ET

text = ET.fromstring(F_D.text.encode('UTF-8'))
text = ET.fromstring(P.text.encode('UTF-8'))
F_D=requests.get('http://en.wikipedia.org/wiki/Ant_colony_optimization_algorithms')
P=requests.get('http://en.wikipedia.org/wiki/Wikipedia:Unusual_articles')  
links = text.findall('.//*[@id=”mw-content-text”]/p/a')

links=E_D

E_D = graph_dict
E_D[start] = 0

for vertex in E_D:
    F_D[vertex] = E_D[vertex]
    if vertex == end: break

    for edge in graph[vertex]:
        path_distance = F_D[vertex] + graph[vertex][edge]
        if edge in F_D:
            if path_distance < F_D[edge]:
                #raise ValueError,
            elif edge not in E_D or path_distance < E_D[edge]:
                E_D[edge] = path_distance
                [edge] = vertex
return (F_D,P)

def Shortest_Path(graph,start,end):
  F_D,P = D_Algorithm(graph,start,end)
  path = []
  while 1:
    path.append(end)
    if end == start: break
    end = P[end]
  path.reverse()
  return path
vidit
  • 6,293
  • 3
  • 32
  • 50
Jefferson X Masonic
  • 583
  • 2
  • 12
  • 27
  • 4
    I so want to know why you are doing this? :) – Toby Allen Apr 13 '13 at 18:23
  • Toby I am learning python, I want to do more exercisea as I can , if you can help thanks if you can't als thanks and enjoy the weekend ;) – Jefferson X Masonic Apr 13 '13 at 19:04
  • 1
    Removed the "Windows" tag since I didn't see anything windows-specific in the question. Put back if that's an error on my part. – angelatlarge Apr 13 '13 at 19:43
  • its okay no problem , – Jefferson X Masonic Apr 13 '13 at 19:56
  • 2
    You could try using "what links here" on Wikipedia to find all incoming links to a page. That would make the problem much easier to solve. – Anderson Green Apr 13 '13 at 20:23
  • 2
    As @AndersonGreen suggested, use the inbound links and do a simple BFS. Unless you have a weighting function for the edges (which I don't see and can't imagine you would), you can consider all edges having equal weight, which makes all shortest paths those that discover the target at the same BFS-distance from the source. – G. Bach Apr 13 '13 at 20:47
  • 1
    Check out http://mu.netsoc.ie/wiki/ for somebody who has done this previously. Its from March 2008 so the data is out of date, and written in Perl and C++ from what I can tell, but the description of how he solved it might be useful. – Blair Apr 13 '13 at 22:14
  • If you don't want to take the distributed computing approach of the guy @Blair mentioned, you should look into the Floyd–Warshall algorithm because it solves the problem once and for all for any two given articles (something one would probably want in a real application). However, it will need a data structure that'll be terabytes in size for the current English Wikipedia. It will also need ages to create that data structure. Lookup is lightning fast though. – Christian Sep 13 '15 at 12:37
  • @Toby Allen - For another approach in knowledge discovery, see also - fetching connections between two or more articles in small-world Wikipedia network: http://stackoverflow.com/a/40965046/305883 – user305883 Dec 05 '16 at 02:57
  • Would be nice to if you have something working @JeffersonXMasonic – Daniel Dec 26 '16 at 01:29

2 Answers2

2

We are looking at graph exploration... why should you be considering Dijkstra's algorithm??? IMHO... change the approach.

First, you need a good heuristic function. For every node you expand, you need to geusstimate the distance of that node from the target/goal node. Now... how you compute the heuristic is the real challenge here. You may perhaps do a keyword mapping between the current wiki page and your destination page. A percentage of match may give you the estimate. Or... try to guess the relevance of content between the two pages. I have a hunch... perhaps a Neural Network may help you here. But, this may not indicate optimal estimate either. I'm not sure. Once you figure out a suitable way of doing this, use A* search algorithm.

Search and explore the heuristic function, do not go for breadth first search, you'll end up no where in the vast wide world of wikipedia!

metsburg
  • 2,021
  • 1
  • 20
  • 32
  • and A* is just a little modification over Dijkstra's algo.... shouldn't be very difficult to modify your existing code into this. – metsburg Apr 16 '13 at 06:22
  • still I cant figure out this problem some one help me :(((( – Jefferson X Masonic Apr 23 '13 at 13:57
  • When you're computing path_distance = F_D[vertex] + graph[vertex][edge], you're taking the distance function of current vertex and adding the cost of the next edge... right? But, suppose you have a heuristic estimate of the distance to the goal node (i.e, the degree of match between your current wiki page and destination page), denoted by H(vertex), then your cost function becomes = F_D[vertex] + graph[vertex][edge] + H(vertex). – metsburg Apr 24 '13 at 05:09
  • Insert these nodes associated with their corresponding cost function into a binary min heap (or just any sorted list). while expanding the next node, pop/fetch the smallest element of this list/heap and expand that iteratively. – metsburg Apr 24 '13 at 05:09
-1

Given the number of articles on wikipedia, it would take a unaffordable time to compute THE shortest (my assumption - I haven't tried).

The real problem is to find an acceptable and efficent short path between two articles.

Algorithms that deal with this kind problem are related to The travelling salesman problem. It could be a good point to start from.

IIRC google or yahoo bots use Ant Colony Optimization to get the shortest acceptable in optimized time. You could check this SO question: Where can I learn more about "ant colony" optimizations?

I'm personnally also fond of the genetic algorithms approach to find an acceptable optimum in a certain amount of time.


I have just looked at that image and that sets the number of articles to 4.000.000 for en.wikipedia.com in 2013. Much less than I thought indeed.

EDIT: I first stated it was a NP-Hard problem and commenters explain it's not.

Community
  • 1
  • 1
Stephane Rolland
  • 38,876
  • 35
  • 121
  • 169
  • 3
    Why would this be NP-hard? – KillianDS Apr 13 '13 at 20:06
  • Yeah this is just graph search (of course realistically you need to do this carefully, so you don't run out of memory), which has a poly time solution in the number of vertices and edges in the graph. What NP-Hard reduction are you referring to, I don't think this is TSP actually, since that is asking the shortest path through every node, not the shortest path between two nodes. – Anil Vaitla Apr 13 '13 at 20:06
  • @tigger that's right. Re-reading the article on TSP and NP-hard. Our case is not NP-Hard and not a case of TSP. But do you mean that something in complexity bigger than O(n), like graph walking, is a **possible** approach for browsing the wikipedia graph ? ( I find myself confused with NP hard and complete ) – Stephane Rolland Apr 13 '13 at 20:24
  • 2
    @StephaneRolland NP-hard >= NP-complete. Problem X that is at least as hard as the most complex problem Y in NP is NP-hard; X does not have to be a member of NP. NP-complete means "NP-hard and in NP", so those are the most complex problems that are in NP. Also, I'm curious why you mention the number of articles on wikipedia in the context of complexity. – G. Bach Apr 13 '13 at 20:46
  • @G.Bach Is the goal of complexity not to weight the number of operations needed ? – Stephane Rolland Apr 13 '13 at 20:52
  • @StephaneRolland Complexity is a means of classifying problems according to their "difficulty". It is not concerned with the size of the specific instances of the problem. For example, graph traversal has polynomial complexity, O(n+m) in Landau notation under common assumptions. It doesn't matter whether the graph has 2 or 10^100 vertices/edges, the complexity of the problem doesn't change (the time required for execution of course will). – G. Bach Apr 13 '13 at 20:54
  • @G. Bach oh... sorry. I thought polynomial was O(n²) and forbidden things like that... – Stephane Rolland Apr 13 '13 at 20:57
  • @G.Bach However I **explicitely** spoke about time of execution, so please, don't be completely surprised :-) – Stephane Rolland Apr 13 '13 at 21:00
  • O(n^2) is polynomial, but that doesn't make a problem that has a solution of that time complexity NP-hard (unless P = NP). In fact, any problem where we have a polynomial time solution means that problem cannot be NP-hard. Regarding the number of articles, I would have understood if you had mentioned it in the context of execution time, but you only did so after talking about complexity; I thought there might be a confusion there. – G. Bach Apr 13 '13 at 21:02
  • @G.Bach don't apologize, you states things clearly. I have not that steady knowledge of the domain. – Stephane Rolland Apr 13 '13 at 21:05
  • I understand, both edits seemed related. That's right. I have re-edited. – Stephane Rolland Apr 13 '13 at 21:07
  • It means all I did here using djkistry algorithm is nothing..?? , because other algorithm will be tough , plz what should I do in my code ..? – Jefferson X Masonic Apr 13 '13 at 22:52
  • Then you should choose your pages for test really carefully. First take two articles that relates to one an another. Then enlarge slowly the distance. But I do have the gut feeling that the numbers of links explodes, in just a few levels... making the job really hard for your algorithm to calculate the weights for all the paths. – Stephane Rolland Apr 14 '13 at 00:48
  • Is it me or your link http://en.wikipedia.org/wiki/Nazwa_artykułu doesn't link anywhere ? – Stephane Rolland Apr 14 '13 at 00:49
  • I have edited my answer and add links to http://en.wikipedia.org/wiki/Ant_colony_optimization_algorithms , this may be a good compromise for finding shortest distance between points in a huge huge graph. I think - I must recheck that point - that some search engines bots are/have been using this kind of algorithms to weight distances between all the url of the www. – Stephane Rolland Apr 15 '13 at 01:05
  • sorry that link is empty u can add any link thanks for the reply can you post here your full code which you edited and it works ? , thanks in advance – Jefferson X Masonic May 14 '13 at 22:00