3

Is there something a [directional?] notion/implementation of distance between Wikipedia categories/pages?

For example consider: A) "Saint Louis University" B) "university"

Clearly "A" is a type of "B". How can you extract this from Wiki? If you extract all the categories connect to A, you'd see that it gives

Category:1818 establishments in Missouri Territory 
Category:Articles containing Latin-language text 
Category:Association of Catholic Colleges and Universities
Category:Commons category with local link same as on Wikidata
Category:Coordinates on Wikidata 
Category:Educational institutions established in 1818
Category:Instances of Infobox university using image size
Category:Jesuit universities and colleges in the United States
Category:Roman Catholic Archdiocese of St. Louis
Category:Roman Catholic universities and colleges in Missouri

and it does not contain anything that would directly connect to B (https://en.wikipedia.org/wiki/University). But essentially if you look further, you should be able to find a multi-hop path between A and B, possibly multiple hops. What are the popular ways of accomplishing this?

Daniel
  • 5,839
  • 9
  • 46
  • 85
  • you can consider looking into my project on extracting Wikipedia category hierarchy - https://github.com/wasiahmad/Mining-Wikipedia/tree/master/WikiNomy – Wasi Ahmad Dec 26 '16 at 05:14
  • @WasiAhmad How does your project differ from accessing Wiki information via MediaWiki api? – Daniel Dec 26 '16 at 05:15
  • My project doesn't use any API, it extracts the category hierarchy directly from Wiki dump. I needed the entire Wiki category hierarchy for one of my research work, so i developed that project. – Wasi Ahmad Dec 26 '16 at 05:20
  • DBPedia http://dbpedia.org ? – alvas Dec 27 '16 at 08:10

3 Answers3

1

If you have the entire Wikipedia category taxonomy, then you can compute the distance (shortest path length) between two categories. If one category is the ancestor of other, it is straight forward.

Otherwise you can find the Least Common Subsumer which is defined as follows.

Least common subsumer of two concepts A and B is the most specific concept which is an ancestor of both A and B.

Then compute the distance between them via LCS.

I encourage you to go through similarity measures where you will find state-of-art techniques to compute semantic similarity between words.

Resource: My project on extracting Wikipedia category/concept might help you.

One very good related example

Compute semantic similarity between words using WordNet. WordNet organizes English words in hierarchical fashion. See this wordnet similarity for java demo. It uses eight different state-of-techniques to compute semantic similarity between words.

Wasi Ahmad
  • 35,739
  • 32
  • 114
  • 161
1

Some ideas/resources I collected. Will update this if I find more.

-- Using DBPedia: knowledge base curated based on Wiki. They provide an SparQL end-point to query this KB. But one has to simulate the desired similarity/distance behavior via their SparQL interface. Some ideas are here and here, but they seem to be outdated.

-- Using UMBEL: http://umbel.org/ which is a knowledge graph of concepts. I think the size of this knowledge graph is relatively small. But the I suspect that its precision is probably high. That being said, I'm not sure how this relates to Wikipedia at all. They have this api for calculating the distance measure between any pair of their concepts (at the moment of writing this post, their similarity API is down. So not a feasible solution at the moment).

-- Using http://degreesofwikipedia.com/ I don't the details of their algorithm and how they do, but they provide a distance between Wiki-concepts. And also this is directional. For example this and this.

Community
  • 1
  • 1
Daniel
  • 5,839
  • 9
  • 46
  • 85
1

You might be looking for the "is a" relationship: Q734774 (the Wikidata item for Saint Louis University) is a university, a building and a private not-for-profit educational institution. You can use SPARQL to query it:

Tgr
  • 27,442
  • 12
  • 81
  • 118
  • This is very nice @Tgr! Could you also write an equivalent form of the first query using DBPedia? – Daniel Dec 27 '16 at 00:16
  • I'm not familiar with DBPedia. My general impression was that they have more data but it's flatter (mostly infobox parameter-value pairs) so it's less suitable for queries like this... might be completely wrong about that though. – Tgr Dec 27 '16 at 01:22
  • I see thanks @Tgr. Another one. How can print the results of the `ASK` query (for the first link)? Like I want to see the path that connects the two. – Daniel Dec 28 '16 at 05:11
  • The second query does that. If you remove the `count()`, it will show the path segments instead. – Tgr Dec 28 '16 at 07:13