computing distance between statements

Question

Is it possible to compute the distance between two statements, in SPARQL or Jena? For example, is it possible to compute the distance between:

immanuel_kant dbprop:birthPlace Germany
John_Lock     dbprop:birthPlace England

I want to calculate the distance ( number of the edges??) beetween those statements — user2837896, Oct 08 '13 at 13:41
I repeat my question to you. What is the "distance" between those two statements? As ask this because to my mind, your example doesn't make sense. — Stephen C, Oct 08 '13 at 13:44
I would calculate a similarity measure between two instances ( in this case Kant e Lock) depending on ONLY a specific property in this case birthPlace. This is my goal — user2837896, Oct 08 '13 at 13:48
It would be cool if it could abstract like so: Subject: Philosopher, object: Europe — Ingo, Oct 08 '13 at 13:50
This is very similar to [your previous question](http://stackoverflow.com/q/19133335/1281433) where you _accepted_ an answer that said that calculating the distance between two statements just based on the statements themselves can't be calculated. The comments asked for more information about what you mean by distance, and you never clarified there, either. If you can explain the distance computation that you want to perform, and show what it should return for some particular instance (e.g., do it by hand for one example), then you _might_ be able to get some useful answers here. — Joshua Taylor, Oct 08 '13 at 14:14
In my example Lock and kant are both philosophers: I calculated a taxonomy similarity between these instances and I obtained a number : 0.60. This my similarity depends on the taxonomy and subclasses. NOW I WANT refine this similarity : I want to compare Kant e Lock not only depending on the taxonomy but also depending on a specific property. In this case for example, the similarity (0.60) that I Found, it would decrease becouse Lock and Kant have different birthPlace. — user2837896, Oct 08 '13 at 14:39
If two philosophers had have "Manchester" and "Liverpool" as birth places, the similarity would be increased becouse the two cities are both english! I don't know if I was clear — user2837896, Oct 08 '13 at 14:40
You haven't told us how you "I calculated a taxonomy similarity between these instances and I obtained a number : 0.60." I can claim that I computed it too and got .67, so you must have made a mistake. :) Nobody can tell until you tell us _how_ you're trying to compute these values. We can only help you fix things that aren't working right if we know _how_ they're supposed to work. Otherwise, I propose this fix: compute the value (e.g., .60) as you have been before for each pair of philosophers, and then find birthplaces. If the birthplaces are different, subtract .07 from the similarity. — Joshua Taylor, Oct 08 '13 at 15:09
For what it's worth, it sounds like whatever it is you're trying to do might be possible in SPARQL. E.g., if you have that .60 value, then you could bind that in a SPARQL query, retrieve the birthplaces of the resources you care about, and then compute a new value based on whether the birthplaces match. — Joshua Taylor, Oct 08 '13 at 15:10
I computed the taxonomy similarity in this way: FirstSet = set of all classes and subclasses of the 1st instance; SecondSet= set of all classes and subclasses of the 2nd instance; Similarity= |FirtSet intersection SecondSet| / |FirtSet union SecondSet| — user2837896, Oct 08 '13 at 15:32
@user2837896 Thank you! That's actually not too hard to compute in SPARQL. As my answer (which doesn't include that metric yet) shows, it's easy to modify an existing similarity measure using additional data. — Joshua Taylor, Oct 08 '13 at 15:37
But your modify of the similarity is based on a TOTAL match of the birth places ..I want that also Liverpool and London are considered "Similar" and not different...are you understand? — user2837896, Oct 08 '13 at 15:44
You'll need to provide the method for _computing_ whether two cities are similar or not. Once you do that, we can figure out how to encode it in SPARQL. My point is just that once you've told us the formula, it won't be too hard to compute it in SPARQL. — Joshua Taylor, Oct 08 '13 at 15:49
I've updated my answer to show how you can compute your initial similarity measure. Putting this together with the first query I've shown, you should be able to modify the similarity value, so long as you can encode the formula in SPARQL. — Joshua Taylor, Oct 08 '13 at 15:56

Joshua Taylor · Answer 1 · 2013-10-08T19:43:53.030

1

It's hard to tell exactly what you're trying to compute (because we haven't been told), but it sounds like you'll be able to do this in SPARQL. The following query first computes a similarity metric for pairs of philosophers and binds it to ?initialSimilarity. It's a just the ratio of the length of their names. It's not a particularly good similarity measure, but you said that you've already got some of these defined (the .60 that mentioned in the comments). Then the query retrieves the birthplaces of the two philosophers. If they're the same, then .05 is added to the similarity metric, or if they're different, .05 is subtracted, and this value is bound to ?finalSimilarity. (Note that individuals may have multiple values for the birthPlace property, so you'll see the same pair of philosophers appear n×m times, where n is the number of birthplaces one has, and m the number that the other has. You could group by pairs here and then take the average of the final similarities, or you could do something to resolve the multiple statements, e.g., sample a representative birthplace for each one.)

select ?name1 ?name2 ?bp1 ?bp2 ?initialSimilarity ?finalSimilarity where { 
  dbpedia-owl:Philosopher ^a ?phil1, ?phil2 .
  ?phil1 rdfs:label ?name1 .
  ?phil2 rdfs:label ?name2 .
  filter( langMatches(lang(?name1),"en") && langMatches(lang(?name2),"en"))

  bind ( strlen(?name1) as ?len1 )
  bind ( strlen(?name2) as ?len2 )
  bind ( if(?len1 < ?len2, ?len1, ?len2) as ?minLen )
  bind ( if(?len1 < ?len2, ?len2, ?len1) as ?maxLen )
  bind ( ?minLen/xsd:double(?maxLen) as ?initialSimilarity )

  ?phil1 dbpedia-owl:birthPlace ?bp1 .
  ?phil2 dbpedia-owl:birthPlace ?bp2 .
  bind ( if( ?bp1 = ?bp2, ?initialSimilarity + .05, ?initialSimilarity - .05) as ?finalSimilarity )
}
limit 10

SPARQL Results

Based on the clarfications in the comments, it's not too hard to compute your initial similarity metric, which you've defined as the number of classes in common over the number of classes that the individuals have in total. This can be done with a query like this:

select ?philosopher1
       ?philosopher2
       (count(distinct ?commonType) as ?intersection)
       (count(distinct ?eitherType) as ?union)
       (count(distinct ?commonType)/xsd:double(count(distinct ?eitherType)) as ?similarity)
where {
  dbpedia-owl:Philosopher ^a ?philosopher1, ?philosopher2 .
  filter( ?philosopher1 != ?philosopher2 )
  ?commonType ^a ?philosopher1, ?philosopher2 .
  { ?eitherType ^a ?philosopher1 } UNION
  { ?eitherType ^a ?philosopher2 } 
}
group by ?philosopher1 ?philosopher2 
limit 3

SPARQL results

which produces results like this:

philosopher1                                  philosopher2                                    intersection  union similarity
http://dbpedia.org/resource/Bawa_Muhaiyaddeen http://dbpedia.org/resource/Abdolkarim_Soroush  6             34    0.176471
http://dbpedia.org/resource/Eric_Voegelin     http://dbpedia.org/resource/Abdolkarim_Soroush  6             30    0.2
http://dbpedia.org/resource/Eric_Ormsby       http://dbpedia.org/resource/%C3%89mile_Meyerson 18            24    0.75

All you need to do is use a query like the first one to additionally select the birthplaces of the philosophers, and then execute whatever formula you're using to compute similarity to get the similarity modifier, and then you can modify the similarity value.

edited Oct 08 '13 at 19:43

answered Oct 08 '13 at 15:35

Joshua Taylor

84,998
9
154
353

Thanks Joshua. But the my problem is to calculate the modified similarity..Have you some idea or formula? – user2837896 Oct 08 '13 at 18:21
The problem could rise in the case in which the object of the property is literal or numeric...I have to be able to compare every istances depending on EVERY property – user2837896 Oct 08 '13 at 18:33
@user2837896 What do you mean on every property? You said in the comments that "[you] would calculate a similarity measure between two instances ( in this case Kant e Lock) depending on ONLY a specific property in this case birthPlace." – Joshua Taylor Oct 08 '13 at 19:39
@user2837896 I _did_ show a way to compute the modified similarity. In the first case I computed a "base similarity" and then added or removed from it based on whether they had some property in common. You could modify the second case to compare how many birthplaces they have in common and then, e.g., increase or decrease their similarity based on that. – Joshua Taylor Oct 08 '13 at 19:45
One moment..I computated this distances: Kant-Locke similarity=0.34 – user2837896 Oct 08 '13 at 19:59
That's right, if you use [this query](http://pastebin.com/R1uqUbP8) which uses `values` to make ?phil1 and ?phil2 be Kant and Locke, you'll see the similarity is 0.34 . – Joshua Taylor Oct 08 '13 at 20:04
Joshua , please, read my answer – user2837896 Oct 08 '13 at 20:59
@user2837896 The only actual formula you've mentioned for computing _any_ kind of similarity is the (categories in common/total categories). You asked how that could be adjusted in some way based on the values of a single property for the resources being compared, but have not specified _any_ formula for _how_ you want to adjust the measure. The code in this answer shows a way to do this. You'll need to fill in the formula that you want to use to adjust the similarity measure, because we can't, because you haven't provided it. – Joshua Taylor Oct 08 '13 at 21:17
I applied the taxonomy similarity to the city resources ( common/total categories) – user2837896 Oct 08 '13 at 21:20
@user2837896 Ah, sorry if I missed that you were using the same metric for comparing cities. DBpedia doesn't necessarily have complete information, and the kinds of information it has for some resources will be different than the kind it has for others. Comparing categories for cities won't necessarily compute a meaningful semantic similarity. If some city has lots of categories that no other cities do, then it will be less similar to all other cities. It's just an aspect of how Wikipedia categories are. – Joshua Taylor Oct 08 '13 at 21:23
So, Is the use of common/total categories wrong? – user2837896 Oct 08 '13 at 21:30
2

@user2837896 There's no single right or wrong metric for semantic similarity (already a vague and poorly defined idea). You have some data available to you from DBpedia. You can compute things based on these data. Some computations will produce measures that correlate well with some notion of semantic similarity, and some won't. The formula you've proposed isn't terrible; since there's some uniformity in topic coverage on Wiki/DBpedia, it's probably useful in lots of situations. The data aren't perfect though; there are places where it won't work as well. It sounds like you found an example. – Joshua Taylor Oct 08 '13 at 21:46
Maybe I found the solution..Can I use SPARQL-DL with a query to service http://dbpedia.org/sparql ? – user2837896 Oct 09 '13 at 10:11
@user2837896 I don't believe that the DBpedia endpoint supports SPARQL-DL (I don't know how many endpoints do, actually). You can download the DBpedia dataset and host it locally, though. Perhaps you can host it locally using a SPARQL-DL endpoint. Best luck! – Joshua Taylor Oct 09 '13 at 12:23

computing distance between statements

1 Answers1