I have a large TDB dataset (cf. this post Fuseki config for 2 datasets + text index : how to use turtle files? ) and I need to extract data in order to make a "subgraph" and import it in fuseki.
I found that OFFSET
could be a solution to get all results of a query if these results are too numerous (about 12M triples).
Here are my questions :
1) I read on the W3C recommendation that OFFSET
should be used with ORDER BY
:
Using LIMIT and OFFSET (...) will not be useful unless the order is made predictable by using ORDER BY.
(cf. https://www.w3.org/TR/rdf-sparql-query/#modOffset )
-- Unfortunately, ORDER BY
seems to be very long on my dataset. I found some examples of OFFSET whithout ORDER BY (here's one : Getting list of persons using SPARQL dbpedia), so I tried to use OFFSET
alone, it seems to work.
-- I need to be sure that if I repeat the same query with, I'll get all results. Therefore I've tried on a sample, and checked that the results give distinct values and the expected number, everything seems ok. So I assume that ORDER BY is needed only if the dataset is modified between 2 queries ("predictable order")?
2) Is the performance dependant on the ratio limit/offset?
-- I tried LIMIT = 100, 1000, 5000, 10000 with the same offset, it seems to be nearly the same speed.
-- Also tried to compare different values for OFFSET, and it seems that the execution time is longer for a big offset (but maybe it's only a problem with TDB : cf : https://www.mail-archive.com/users@jena.apache.org/msg13806.html)
~~~~~~ more info ~~~~~~
-- I use a script with tdbquery
and this command :
./tdbquery --loc=$DATASET --time --results=ttl "$PREFIXES construct { ?exp dcterms:title ?titre } where { ?manif dcterms:title ?titre ; rdarelationships:expressionManifested ?exp } limit $LIMIT offset $OFFSET"
-- Dataset : ~168M triples, and ~12M triples with dcterms:title .
~~~~~~~~~~~~~~~~~~~~~~
Thanks in advance