I have the following federated SPARQL query that works as I expect in TopBraid Composer Free Edition (version 5.1.4) but does not work in Apache Fuseki (version 2.3.1):
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX movie: <http://data.linkedmdb.org/resource/movie/>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT ?s WHERE {
SERVICE <http://data.linkedmdb.org/sparql> {
<http://data.linkedmdb.org/resource/film/1> movie:actor ?actor .
?actor movie:actor_name ?actorName .
}
SERVICE <http://dbpedia.org/sparql?timeout=30000> {
?s ?p ?o .
FILTER(regex(str(?s), replace(?actorName, " ", "_"))) .
}
}
I monitor the sub SPARQL queries that are being executed under the hood and notice that TopBraid correctly executes the following query to the http://dbpedia.org/sparql endpoint:
SELECT *
WHERE
{ ?s ?p ?o
FILTER regex(str(?s), replace("Paul Reubens", " ", "_"))
}
while Apache Fuseki executes the following sub query:
SELECT *
WHERE
{ ?s ?p ?o
FILTER regex(str(?s), replace(?actorName, " ", "_"))
}
Notice the difference; TopBraid replace the variable ?actorName with a particular value 'Paul Reubens', while Apache Fuseki does not. This results in an error from the http://dbpedia.org/sparql endpoint because the ?actorName is used in the result set but not assigned.
Is this a bug in Apache Fuseki or a feature in TopBraid? How can I make Apache Fuseki correctly execute this Federated query.
update 1: to clarify the behaviour difference between TopBraid and Apache Fuseki a bit more. TopBraid executes the linkedmdb.org subquery first and then executes the dbpedia.org subquery for each result of the linkedmdb.org query )(and substitutes the ?actorName with the results from the linkedmdb.org query). I assumed Apache Fuseki behaves similar, but the first subquery to dbpedia.org fails (because ?actorName is used in the result set but not assigned) and so it does not continue. But now I am not sure if it actually want to execute the subquery to dbpedia.org multiple times, because it never gets there.
update 2: I think both TopBraid and Apache Fuseki use Jena/ARQ, but I noticed that in stack traces from TopBraid the package name is something like com.topbraid.jena.* which might indicate they use a modified version of Jena/ARQ?
update 3: Joshua Taylor says below: "Surely you wouldn't expect the second service block to be executed for each one of them?". Both TopBraid and Apache Fuseki use exactly this method for the following query:
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX movie: <http://data.linkedmdb.org/resource/movie/>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT ?film ?label ?subject WHERE {
SERVICE <http://data.linkedmdb.org/sparql> {
?film a movie:film .
?film rdfs:label ?label .
?film owl:sameAs ?dbpediaLink
FILTER(regex(str(?dbpediaLink), "dbpedia", "i"))
}
SERVICE <http://dbpedia.org/sparql> {
?dbpediaLink dcterms:subject ?subject
}
}
LIMIT 50
but I agree that in principle they should execute both parts once and join them, but maybe for performance reasons they chose a different strategy?
Additionally, notice how the above query works on Apache Fuseki, while the first query of this post does not. So, Apache Fuseki is actually behaving similarly to TopBraid in this particular case. It seems to be related to using an URI variable (?dbpediaLink) in two triple patterns (which works in Fuseki) compared to using a String variable (?actorName) from a triple pattern in a FILTER regex function (which does not work in Fuseki).