You can compute this kind of metric using SPARQL, though it's a bit ugly. Let's assume some data like this:
prefix dcterms: <http://purl.org/dc/terms/>
prefix : <http://example.org/books/>
:book1 a :Book ; dcterms:subject :subject1 , :subject2, :subject3 .
:book2 a :Book ; dcterms:subject :subject2 , :subject3, :subject4 .
:book3 a :Book ; dcterms:subject :subject4 , :subject5 .
There are three books. Books 1 and 2 have two subjects in common, and one each that the other does not have. Books 2 and 3 have one subject in common, but Book 2 has 2 that Book 3 does not have, while Book 3 has only one that Book 2 does not have, Books 1 and 3 have no subjects in common.
The trick here is to use some nested subqueries, and to grab the different values (C10, C01, and C11) at different levels in the nesting. The innermost query is
select ?book1 ?book2 (count(?left) as ?c10) where {
:Book ^a ?book1, ?book2 .
FILTER( !sameTerm(?book1,?book2) )
OPTIONAL {
?book1 dcterms:subject ?left .
FILTER NOT EXISTS { ?book2 dcterms:subject ?left }
}
}
group by ?book1 ?book2
which grabs each pair of distinct books and computes the number of subjects that the left book has that the right doesn't. By wrapping this in another query, we can then grab the number of subjects that the right book has that the left doesn't. This makes the query:
select ?book1 ?book2 (count(?right) as ?c01x) (sample(?c10) as ?c10x) where {
{
select ?book1 ?book2 (count(?left) as ?c10) where {
:Book ^a ?book1, ?book2 .
FILTER( !sameTerm(?book1,?book2) )
OPTIONAL {
?book1 dcterms:subject ?left .
FILTER NOT EXISTS { ?book2 dcterms:subject ?left }
}
}
group by ?book1 ?book2
}
OPTIONAL {
?book2 dcterms:subject ?right .
FILTER NOT EXISTS { ?book1 dcterms:subject ?right }
}
}
group by ?book1 ?book2
Note that we still have to select ?book1
and ?book2
, and sample(?c10) as ?c10x
in order to pass the values outward. (We have to use ?c10x
because the name ?c10
has already been used at this scope. Finally, we wrap this in one more query to get the common subjects, and to do the computation, which gives us:
prefix dcterms: <http://purl.org/dc/terms/>
prefix : <http://example.org/books/>
select ?book1 ?book2
(count(?both) as ?c11)
(sample(?c10x) as ?c10)
(sample(?c01x) as ?c01)
(count(?both) / (count(?both) + sample(?c10x) + sample(?c01x)) as ?sim)
where {
{
select ?book1 ?book2 (count(?right) as ?c01x) (sample(?c10) as ?c10x) where {
{
select ?book1 ?book2 (count(?left) as ?c10) where {
:Book ^a ?book1, ?book2 .
FILTER( !sameTerm(?book1,?book2) )
OPTIONAL {
?book1 dcterms:subject ?left .
FILTER NOT EXISTS { ?book2 dcterms:subject ?left }
}
}
group by ?book1 ?book2
}
OPTIONAL {
?book2 dcterms:subject ?right .
FILTER NOT EXISTS { ?book1 dcterms:subject ?right }
}
}
group by ?book1 ?book2
}
OPTIONAL {
?both ^dcterms:subject ?book1, ?book2 .
}
}
group by ?book1 ?book2
order by ?book1 ?book2
This rather monstrous query, applied to our data, computes these results:
$ arq --data data.n3 --query similarity.sparql
--------------------------------------------
| book1 | book2 | c11 | c10 | c01 | sim |
============================================
| :book1 | :book2 | 2 | 1 | 1 | 0.5 |
| :book1 | :book3 | 0 | 3 | 2 | 0.0 |
| :book2 | :book1 | 2 | 1 | 1 | 0.5 |
| :book2 | :book3 | 1 | 2 | 1 | 0.25 |
| :book3 | :book1 | 0 | 2 | 3 | 0.0 |
| :book3 | :book2 | 1 | 1 | 2 | 0.25 |
--------------------------------------------
If the FILTER( !sameTerm(?book1,?book2) )
line is removed, so that similarity of each book to itself is also computed, we see the correct value (1.0):
$ arq --data data.n3 --query similarity.sparql
--------------------------------------------
| book1 | book2 | c11 | c10 | c01 | sim |
============================================
| :book1 | :book1 | 3 | 0 | 0 | 1.0 |
| :book1 | :book2 | 2 | 1 | 1 | 0.5 |
| :book1 | :book3 | 0 | 3 | 2 | 0.0 |
| :book2 | :book1 | 2 | 1 | 1 | 0.5 |
| :book2 | :book2 | 3 | 0 | 0 | 1.0 |
| :book2 | :book3 | 1 | 2 | 1 | 0.25 |
| :book3 | :book1 | 0 | 2 | 3 | 0.0 |
| :book3 | :book2 | 1 | 1 | 2 | 0.25 |
| :book3 | :book3 | 2 | 0 | 0 | 1.0 |
--------------------------------------------
If you don't need to preserve the various Cmn values, then you might be able to optimize this, e.g., by computing C01 in the innermost query, and the C10 in the next to middle query, but then instead of projecting both up individually, product just their sum (C10+C01) so that in the outermost query where you compute C11, you can just do (C11 / (C11 + (C10+C01))).