SPARQL Speed up federated query

Question

I have my own dataset and I want to perform a federated query in SPARQL. Here is the query:

PREFIX : <http://myURIsNamespace#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX ps: <http://www.wikidata.org/prop/statement/>
PREFIX pq: <http://www.wikidata.org/prop/qualifier/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>

select * where { 
    ?bioentity :hasMutatedVersionOf ?gene .
    ?gene :partOf wd:Q430258 .

    SERVICE <https://query.wikidata.org/sparql> { 
        ?gene p:P644 ?statement; 
              wdt:P31 wd:Q7187 ;
              wdt:P703 wd:Q15978631 ;
              wdt:P1057 wd:Q430258 .
        ?statement ps:P644 ?start .
        ?statement pq:P659 wd:Q20966585 .

        ?gene p:P645 ?statement2. 
        ?statement2 ps:P645 ?end .
        ?statement2 pq:P659 wd:Q20966585 .
        FILTER (xsd:integer(?start)>21000000 && xsd:integer(?start)<30000000)  
    }

}

I run the query via graphDB SPARQL interface but it's really really slow. It takes more than a minute to return 8 records. If I split the query in two parts, they are ridiculously fast.

Query#1

select * where { 
    ?bioentity :hasMutatedVersionOf ?gene .
    ?gene :partOf wd:Q430258 .          

}

56 records in 0.1s

Query#2

select * where { 
     SERVICE <https://query.wikidata.org/sparql> { 
        ?gene p:P644 ?statement; 
              wdt:P31 wd:Q7187 ;
              wdt:P703 wd:Q15978631 ;
              wdt:P1057 wd:Q430258 .
        ?statement ps:P644 ?start .
        ?statement pq:P659 wd:Q20966585 .

        ?gene p:P645 ?statement2. 
        ?statement2 ps:P645 ?end .
        ?statement2 pq:P659 wd:Q20966585 .
        FILTER (xsd:integer(?start)>21000000 && xsd:integer(?start)<30000000)  
    }       

}

158 records in 0.5s

Why the is the federation so slow? Is there a way to optimize the performance?

In this particular case, you could place `SERVICE {...}` first, i. e. before `?bioentity :hasMutatedVersionOf ?gene`. On my data, this variant is 10 times faster. It seems that GraphDB performs 56 separate queries to Wikidata in your variant — Stanislav Kralin, Jul 27 '17 at 20:36
@StanislavKralin Whoa! I belived that the query plan was not dependent on the way the query is written (like it happens for SQL). Do federated queries represent a good solution for this kind of problems? It seems faster to perform 2 separated querires and then joining data locally. — floatingpurr, Jul 27 '17 at 21:26
@superciccio14 well you're right, it shouldn't be, but keep in mind that SQL engine developers have had considerably longer to iron out the wrinkles. You may have hit a case where GraphDB's query planner has a glitch. — Jeen Broekstra, Jul 27 '17 at 22:57
@superciccio14, [order matters](https://wiki.blazegraph.com/wiki/index.php/SPARQL_Order_Matters). In GraphDB, you can view plan explanation [in this way](http://graphdb.ontotext.com/documentation/standard/explain-plan.html). It is hard to form effective plan in case of federated queries, since selectivity of "remote" patterns is unknown. — Stanislav Kralin, Jul 28 '17 at 04:59
As to your general question. Federated queries are appropriate, when both "local" and "remote" resultsets are "small". Federated queries are inappropriate, when both resultsets are "large". When local resultset is large and remote resultsets is small, federated queries are appropriate, if "merging" performed locally. When local resultset is small, but remote resultset is large, federated queries are appropriate, if merging performs remotely _and_ queries to remote endpoint are optimized by your local engine (e. g. a single query with `values` is performed instead of many separated queries). — Stanislav Kralin, Jul 28 '17 at 08:10
@StanislavKralin I confirm that I solved simply inverting the order of the two parts (if you want to answer, I'll accept your reply). I did not know that order matters in SPARQL. Thanks! — floatingpurr, Jul 28 '17 at 09:37
@superciccio14, I had tested my query before commenting :-). BTW, how many times the "inverted" query is faster on your data? Are your initial query performing faster on other triplestores/endpoints than on GraphDB (haven't you tested)? — Stanislav Kralin, Jul 28 '17 at 09:47
@StanislavKralin I was sure you've already tested. It was just a confirmation : ) The new query is really faster, very close to the single endpoints performance. In 2 seconds, I have the resultset. I did not test other triplestores different from GraphDB. Do you think that Blazergraph, or other stores, could perform even better? — floatingpurr, Jul 28 '17 at 10:01
Different stores will perform differently, and some won't care about the order of your clauses -- as some will test the result set sizes to optimize execution ("cost-based optimization") when building their execution plan. — TallTed, Jul 28 '17 at 14:12

Stanislav Kralin · Accepted Answer · 2020-12-09T16:07:51.877

Short answer

Just place your SERVICE part first, i. e. before ?bioentity :hasMutatedVersionOf ?gene .
Read a good article on the topic (e. g. chapter 5 of this book)

Relevant quote from the aforementioned article:

3.3.2 Query Optimization and Execution

The execution order of query operators significantly influences the overall query evaluation cost. Besides the important query execution time there are also other aspects in the federated scenario which are relevant for the query optimization:

Minimizing communication cost. The number of contacted data sources directly influences the performance of the query execution due to the communication overhead. However, reducing the number of involved data source trades off against completeness of results.

Optimizing execution localization. The standard query interfaces of linked data sources are generally only capable of answering queries on their provided data. Therefore, joins with other data results usually need to be done at the query issuer. If possible at all, a better strategy will move parts of the result merging operations to the data sources, especially if they can be executed in parallel.

Streaming results. Retrieving a complete result when evaluating a query on a large dataset may take a while even with a well optimized execution strategy. Thus one can return results as soon as they become available, which can be optimized by trying to return relevant results first.

Long answer

Example data

PREFIX : <http://myURIsNamespace#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX ps: <http://www.wikidata.org/prop/statement/>
PREFIX pq: <http://www.wikidata.org/prop/qualifier/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>

INSERT { ?gene rdf:type owl:Thing } 
WHERE {
    SERVICE <https://query.wikidata.org/sparql> { 
        ?gene p:P644 ?statement; 
              wdt:P31 wd:Q7187 ;
              wdt:P703 wd:Q15978631 ;
              wdt:P1057 wd:Q430258 .
        ?statement ps:P644 ?start .
        ?statement pq:P659 wd:Q20966585 .
        ?gene p:P645 ?statement2. 
        ?statement2 ps:P645 ?end .
        ?statement2 pq:P659 wd:Q20966585 .
        FILTER (xsd:integer(?start)>26000000 && xsd:integer(?start)<30000000)  
    }
}

The total number of triples is 79. Please note that 26000000 is used instead of 21000000.

Query 1

PREFIX : <http://myURIsNamespace#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX ps: <http://www.wikidata.org/prop/statement/>
PREFIX pq: <http://www.wikidata.org/prop/qualifier/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>

SELECT * WHERE {
    ?gene rdf:type owl:Thing .
    SERVICE <https://query.wikidata.org/sparql> { 
        ?gene p:P644 ?statement; 
              wdt:P31 wd:Q7187 ;
              wdt:P703 wd:Q15978631 ;
              wdt:P1057 wd:Q430258 .
        ?statement ps:P644 ?start .
        ?statement pq:P659 wd:Q20966585 .
        ?gene p:P645 ?statement2. 
        ?statement2 ps:P645 ?end .
        ?statement2 pq:P659 wd:Q20966585 .
        FILTER (xsd:integer(?start)>20000000 && xsd:integer(?start)<30000000)  
    }
}

Query 2

PREFIX : <http://myURIsNamespace#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX ps: <http://www.wikidata.org/prop/statement/>
PREFIX pq: <http://www.wikidata.org/prop/qualifier/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>

SELECT * WHERE {
    SERVICE <https://query.wikidata.org/sparql> { 
        ?gene p:P644 ?statement; 
              wdt:P31 wd:Q7187 ;
              wdt:P703 wd:Q15978631 ;
              wdt:P1057 wd:Q430258 .
        ?statement ps:P644 ?start .
        ?statement pq:P659 wd:Q20966585 .
        ?gene p:P645 ?statement2. 
        ?statement2 ps:P645 ?end .
        ?statement2 pq:P659 wd:Q20966585 .
        FILTER (xsd:integer(?start)>20000000 && xsd:integer(?start)<30000000)  
    }
    ?gene rdf:type owl:Thing
}

Performance

	Query 1	Query 2
GraphDB	30 sec	1 sec
Blazegraph	1 sec	1 sec

GraphDB behaviour

Executing Query 1, GraphDB performs 79 distinct GET requests to Wikidata¹:

These requests are queries of this kind:

SELECT ?start ?statement ?end ?statement2 WHERE {
        <http://www.wikidata.org/entity/Q18031286> p:P644 ?statement; 
              wdt:P31 wd:Q7187 ;
              wdt:P703 wd:Q15978631 ;
              wdt:P1057 wd:Q430258 .
        ?statement ps:P644 ?start .
        ?statement pq:P659 wd:Q20966585 .
        <http://www.wikidata.org/entity/Q18031286> p:P645 ?statement2. 
        ?statement2 ps:P645 ?end .
        ?statement2 pq:P659 wd:Q20966585 .
        FILTER (xsd:integer(?start)>20000000 && xsd:integer(?start)<30000000)

It seems interesting, that on another machine, GraphDB performs GET requests of another kind:

GET /sparql?queryLn="Sparql"&query=<original_query_service_part>&$gene=<http://www.wikidata.org/entity/Q18031286>

In this request, Sesame protocol is used, these bindings in URL are not a part of SPARQL 1.1 Protocol.

Perhaps the exact kind of a request depends on the value of the internal reuse.vars.in.subselects parameter, which default value is presumably different on Windows and on Linux.

Blazegraph behaviour

Executing Query 1, Blazegraph performs single POST request to Wikidata²:

SELECT  ?gene ?statement ?start ?statement2 ?end
WHERE {
        ?gene p:P644 ?statement; 
              wdt:P31 wd:Q7187 ;
              wdt:P703 wd:Q15978631 ;
              wdt:P1057 wd:Q430258 .
        ?statement ps:P644 ?start .
        ?statement pq:P659 wd:Q20966585 .
        ?gene p:P645 ?statement2. 
        ?statement2 ps:P645 ?end .
        ?statement2 pq:P659 wd:Q20966585 .
        FILTER (xsd:integer(?start)>20000000 && xsd:integer(?start)<30000000)  
    
}
VALUES ( ?gene) {
( wd:Q14908148 ) ( wd:Q15320063 ) ( wd:Q17861651 ) ( wd:Q17917753 ) ( wd:Q17928333 )
( wd:Q18024923 ) ( wd:Q18026347 ) ( wd:Q18030710 ) ( wd:Q18031220 ) ( wd:Q18031457 )
( wd:Q18031551 ) ( wd:Q18031832 ) ( wd:Q18032918 ) ( wd:Q18033094 ) ( wd:Q18033798 )
( wd:Q18034311 ) ( wd:Q18035006 ) ( wd:Q18035085 ) ( wd:Q18035609 ) ( wd:Q18036516 )
( wd:Q18036676 ) ( wd:Q18037580 ) ( wd:Q18038385 ) ( wd:Q18038459 ) ( wd:Q18038737 )
( wd:Q18038763 ) ( wd:Q18039997 ) ( wd:Q18040291 ) ( wd:Q18041261 ) ( wd:Q18041415 )
( wd:Q18041558 ) ( wd:Q18045881 ) ( wd:Q18047232 ) ( wd:Q18047373 ) ( wd:Q18047918 )
( wd:Q18047966 ) ( wd:Q18048744 ) ( wd:Q18049145 ) ( wd:Q18049164 ) ( wd:Q18053139 )
( wd:Q18056540 ) ( wd:Q18057411 ) ( wd:Q18060804 ) ( wd:Q18060856 ) ( wd:Q18060876 )
( wd:Q18060905 ) ( wd:Q18060958 ) ( wd:Q20773708 ) ( wd:Q15312971 ) ( wd:Q17860819 )
( wd:Q17917713 ) ( wd:Q18026310 ) ( wd:Q18027015 ) ( wd:Q18031286 ) ( wd:Q18032599 )
( wd:Q18032797 ) ( wd:Q18035169 ) ( wd:Q18035627 ) ( wd:Q18039938 ) ( wd:Q18041207 )
( wd:Q18041512 ) ( wd:Q18041930 ) ( wd:Q18045491 ) ( wd:Q18045762 ) ( wd:Q18046301 )
( wd:Q18046472 ) ( wd:Q18046487 ) ( wd:Q18047149 ) ( wd:Q18047491 ) ( wd:Q18047719 )
( wd:Q18048527 ) ( wd:Q18049774 ) ( wd:Q18051886 ) ( wd:Q18053875 ) ( wd:Q18056212 )
( wd:Q18056538 ) ( wd:Q18065866 ) ( wd:Q20766978 ) ( wd:Q20781543 )
}

Conclusion

With federated queries, it is hard to create effective execution plan, since selectivity of remote patterns is unknown.

In your particular case, it should be not very important, whether to join results locally or remotely, because both local and remote resultsets are small. However, in GraphDB, joining results remotely is less effective, because GraphDB does not reduce communication costs.

¹ For screenshots creation, <http://query.wikidata.org/sparql> instead of <https://query.wikidata.org/sparql> was used.

² In Blazegraph, one might write hint:Query hint:optimizer "None" to ensure sequential evaluation.

See also GraphDB [9.5](https://graphdb.ontotext.com/documentation/9.5/standard/release-notes.html) release notes. Perhaps GDB-4493 and GDB-3043 make some improvements. — Stanislav Kralin, Dec 09 '20 at 17:23
Correct, GDB-4493 implements batching with default "blocksize" of 15 (i.e. 15 VALUES are sent in each query). I just posted GDB-6485 to document this feature properly and how to control the blocksize. — Vladimir Alexiev, Feb 12 '21 at 09:48

SPARQL Speed up federated query

1 Answers1

Short answer

Long answer

Linked