Why does this federated SPARQL query work in TopBraid but not in Apache Fuseki?

Question

I have the following federated SPARQL query that works as I expect in TopBraid Composer Free Edition (version 5.1.4) but does not work in Apache Fuseki (version 2.3.1):

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX movie: <http://data.linkedmdb.org/resource/movie/>
PREFIX dcterms: <http://purl.org/dc/terms/>

SELECT ?s WHERE {
    SERVICE <http://data.linkedmdb.org/sparql> {
        <http://data.linkedmdb.org/resource/film/1> movie:actor ?actor .
        ?actor movie:actor_name ?actorName .
    }
    SERVICE <http://dbpedia.org/sparql?timeout=30000> {
        ?s ?p ?o .
        FILTER(regex(str(?s), replace(?actorName, " ", "_"))) .
    }
}

I monitor the sub SPARQL queries that are being executed under the hood and notice that TopBraid correctly executes the following query to the http://dbpedia.org/sparql endpoint:

SELECT  *
WHERE
  { ?s ?p ?o
    FILTER regex(str(?s), replace("Paul Reubens", " ", "_"))
  }

while Apache Fuseki executes the following sub query:

 SELECT  *
WHERE
  { ?s  ?p  ?o
    FILTER regex(str(?s), replace(?actorName, " ", "_"))
  }

Notice the difference; TopBraid replace the variable ?actorName with a particular value 'Paul Reubens', while Apache Fuseki does not. This results in an error from the http://dbpedia.org/sparql endpoint because the ?actorName is used in the result set but not assigned.

Is this a bug in Apache Fuseki or a feature in TopBraid? How can I make Apache Fuseki correctly execute this Federated query.

update 1: to clarify the behaviour difference between TopBraid and Apache Fuseki a bit more. TopBraid executes the linkedmdb.org subquery first and then executes the dbpedia.org subquery for each result of the linkedmdb.org query )(and substitutes the ?actorName with the results from the linkedmdb.org query). I assumed Apache Fuseki behaves similar, but the first subquery to dbpedia.org fails (because ?actorName is used in the result set but not assigned) and so it does not continue. But now I am not sure if it actually want to execute the subquery to dbpedia.org multiple times, because it never gets there.

update 2: I think both TopBraid and Apache Fuseki use Jena/ARQ, but I noticed that in stack traces from TopBraid the package name is something like com.topbraid.jena.* which might indicate they use a modified version of Jena/ARQ?

update 3: Joshua Taylor says below: "Surely you wouldn't expect the second service block to be executed for each one of them?". Both TopBraid and Apache Fuseki use exactly this method for the following query:

PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX movie: <http://data.linkedmdb.org/resource/movie/>
PREFIX dcterms: <http://purl.org/dc/terms/>

SELECT ?film ?label ?subject WHERE {
    SERVICE <http://data.linkedmdb.org/sparql> {
        ?film a movie:film .
        ?film rdfs:label ?label .
        ?film owl:sameAs ?dbpediaLink 
        FILTER(regex(str(?dbpediaLink), "dbpedia", "i"))
    }
    SERVICE <http://dbpedia.org/sparql> {
        ?dbpediaLink dcterms:subject ?subject
    }
}
LIMIT 50

but I agree that in principle they should execute both parts once and join them, but maybe for performance reasons they chose a different strategy?

Additionally, notice how the above query works on Apache Fuseki, while the first query of this post does not. So, Apache Fuseki is actually behaving similarly to TopBraid in this particular case. It seems to be related to using an URI variable (?dbpediaLink) in two triple patterns (which works in Fuseki) compared to using a String variable (?actorName) from a triple pattern in a FILTER regex function (which does not work in Fuseki).

Certainly a problem in Fuseki, but I wonder if the query would work in Fuseki without the `timeout` parameter in the endpoint URL? — scotthenninger, Jul 12 '16 at 14:03
@scotthenninger I'm not sure it's a bug in Fuseki; subqueries are executed innermost first, and it might be the same for **service** queries. If that's the case, then a value bound by the first service query **shouldn't** be available in the second service query. — Joshua Taylor, Jul 12 '16 at 14:21
Given that SPARQL is a declarative language, the values of the sub-graph queries should be available. The implementation details of federated queries are an issue. Since both use Jena, I wonder if it isn't different versions of Jena at issue here. — scotthenninger, Jul 12 '16 at 16:36
Regardless, an `?s ?p ?o` query with a timeout that arbitrarily cuts off execution is not a good choice. A `LIMIT` would be better, and reducing the query to specific DBPedia class members would be better yet — scotthenninger, Jul 12 '16 at 16:40
@scotthenninger Thinking about this more, your point about it being a declarative language is a good one. SPARQL is declarative, and the query should have the same results if the order the **service** calls were reversed. But if the **service** blocks were reversed, then there would be no results: the DBpedia query should return no values, because ?actorName would have no value, and the filter couldn't succeed. So what TopBraid is doing is **wrong** because it produces different results than the query with the services reversed, which should be logically the same. I updated my answer. — Joshua Taylor, Jul 12 '16 at 17:25
If the query is executed with the Query Editor in TopBraid, then the query is simply passed to Jena ARQ. It seems unlikely that Fuseki works differently, but it could. So other than differences in Jena versions I can't see how the same SPARQL engine would produce different results. — scotthenninger, Jul 12 '16 at 17:51

Joshua Taylor · Answer 1 · 2016-07-12T17:23:30.250

Updated (Simpler) Response

In the original answer I wrote (below), I said that the issue was that SPARQL queries are executed innermost first. I think that that still applies here, but I think the problem can be isolated even more easily. If you have

service <ex1> { ... }
service <ex2> { ... }

then the results have to be what you'd get from executing each query separately on the endpoints and then joining the results. The join will merge any results where the common variables have the same values. E.g.,

service <ex1> { values ?a { 1 2 3 } }
service <ex2> { values ?a { 2 3 4 } }

would execute, and you'd have two possible values for ?a in the outer query (2 and 3). In your query, the second service can't produce any results. If you take:

?s ?p ?o .
FILTER(regex(str(?s), replace(?actorName, " ", "_"))) .

and execute it at DBpedia, you shouldn't get any results, because ?actorName isn't bound, so the filter will never succeed. It appears that TopBraid is performing the first service first and then injecting the resulting values into your second service. That's convenient, but I don't think it's correct, because it returns different results than what you'd get if the DBpedia query had been executed first and the other query executed second.

Original Answer

Subqueries in SPARQL are executed inner-most first. That means that a query like

select * {
  { select ?x { ?x a :Cat } }
  ?x foaf:name ?name
}

Would first find all the cats, and would then find their names. "Candidate" values for ?x are determined first by the subquery, and then those values for ?x are made available to the outer query. Now, when there are two subqueries, e.g.,

select * {
  { select ?x { ?x a :Cat } }
  { select ?x ?name { ?x foaf:name ?name } }
}

the first subquery is going to find all the cats. The second subquery finds all the names of everything that has a name, and then in the outer query, the results are joined to get just the names of the cats. The values of ?x from the first subquery aren't available during the execution of the second subquery. (At least in principle, a query optimizer might be able to figure out that some things should be restricted.)

My understanding is that service blocks have the same kind of semantics. In your query, you have:

SERVICE <http://data.linkedmdb.org/sparql> {
    <http://data.linkedmdb.org/resource/film/1> movie:actor ?actor .
    ?actor movie:actor_name ?actorName .
}
SERVICE <http://dbpedia.org/sparql?timeout=30000> {
    ?s ?p ?o .
    FILTER(regex(str(?s), replace(?actorName, " ", "_"))) .
}

You say that tracing shows that TopBraid is executing

SELECT  *
WHERE
  { ?s ?p ?o
    FILTER regex(str(?s), replace("Paul Reubens", " ", "_"))
  }

If TopBraid already executed the first service block and got a unique solution, then that might be an acceptable optimization, but what if, for instance, the first query had returned multiple bindings for ?actorName? Surely you wouldn't expect the second service block to be executed for each one of them? Instead, the second service block is executed as written, and will return a result set that will be joined with the result set from the first.

The reason that it probably "doesn't work" in Jena is because the second query doesn't actually bind any variables, so it's pretty much got to look at every triple in the data, which is obviously going to take a long time.

I think that you can get around this by nesting the service calls. If nested service are all launched by the "local" endpoint (i.e., nesting a service call doesn't ask a remote endpoint to make another remote query), then you might be able to do:

SERVICE <http://dbpedia.org/sparql?timeout=30000> {
    SERVICE <http://data.linkedmdb.org/sparql> {
      <http://data.linkedmdb.org/resource/film/1> movie:actor ?actor .
      ?actor movie:actor_name ?actorName .
    }
    ?s ?p ?o .
    FILTER(regex(str(?s), replace(?actorName, " ", "_"))) .
}

That might get you the kind of optimization that you want, but that still seems like it might not work unless DBpedia has some efficient ways of figuring out which triples to retrieve based on computing the replace. You're asking DBpedia to look at all its triples, and then to keep the ones where the string form of the subject matches a particular regular expression. It'd probably be better to construct that IRI manually in a subquery and then search for it. I.e.,

SERVICE <http://dbpedia.org/sparql?timeout=30000> {
  { select ?actor {
      SERVICE <http://data.linkedmdb.org/sparql> {
        <http://data.linkedmdb.org/resource/film/1> movie:actor ?actor . 
        ?actor movie:actor_name ?actorName .
      }
      bind(iri(concat("http://dbpedia.org/resource",
                      replace(?actorName," ","_")))
           as ?actor)
    } } 
  ?actor ?p ?o 
}

I don't believe the inside-out execution is required of SPARQL engines. I know Jena does this, but that's an implementation choice. — scotthenninger, Jul 12 '16 at 16:43
DBPedia won't take the nested service calls. Jena simply packages a `SERVICE` call with prefixes, etc. and send the query to the service. So in the end you're asking the service to call a service, and DBPedia won't do that. — scotthenninger, Jul 12 '16 at 16:47
The sub-select isn't a valid SPARQL query because ?actor is bound in a triple pattern. Therefore it can't be used in a BIND statement. It is a better approach, though, as it does not rely on an SPO query. — scotthenninger, Jul 12 '16 at 16:56
@scotthenninger The actual execution order doesn't have to be "inside-out", but it does have to be *conceptually*. Section [12 Subqueries](https://www.w3.org/TR/sparql11-query/#subqueries) in the spec says "Due to the bottom-up nature of SPARQL query evaluation, the subqueries are evaluated logically first, and the results are projected up to the outer query." But, as I said, that's how they behave *logically*, an implementation could do things in a different order so long as the same results are achieved. — Joshua Taylor, Jul 12 '16 at 17:12
@scotthenninger But in any case, that same section says "Note that only variables projected out of the subquery will be visible, or in scope, to the outer query." If the "parallel" service calls are like subqueries, I don't think that the values that one projects should be visible with the scope of the other. In OP's query, OP is asking the engine to *necessarily* perform the first service call *first*, and then to pass those bindings into the second service call. The second service call won't bind any variables on its own (since the variable in the filter can't have any value at that point). — Joshua Taylor, Jul 12 '16 at 17:15
"Inside out" (functional) is required for joins. Anything else is an optimization (in effect index-joins). Just like `log(1+2)` means pass 3 to the log function. — AndyS, Jul 12 '16 at 20:21

score 2 · Answer 2 · answered Jul 12 '16 at 20:26

(long comment)

Consider:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX movie: <http://data.linkedmdb.org/resource/movie/>
PREFIX dcterms: <http://purl.org/dc/terms/>

SELECT ?s WHERE {
    {
        <http://data.linkedmdb.org/resource/film/1> movie:actor ?actor .
        ?actor movie:actor_name ?actorName .
    }
    {
        ?s ?p ?o .
        FILTER(regex(str(?s), replace(?actorName, " ", "_"))) .
    }
}

that is the same query but with no SERVICE calls. ?actorName is not in a pattern of the inner second {}.

As join is a commutative operation, this has the same answers as the first query.

SELECT ?s WHERE {
    {
        ?s ?p ?o .
        FILTER(regex(str(?s), replace(?actorName, " ", "_"))) .
    }
    {
        <http://data.linkedmdb.org/resource/film/1> movie:actor ?actor .
        ?actor movie:actor_name ?actorName .
    }
}

The SERVICE version highlights this because the parts are executes separately on different machines.

The join of the two parts happens on the results of each part.

Why does this federated SPARQL query work in TopBraid but not in Apache Fuseki?

2 Answers2

Updated (Simpler) Response

Original Answer