SPARQL : OFFSET whithout ORDER BY to get all results of a query?

Question

I have a large TDB dataset (cf. this post Fuseki config for 2 datasets + text index : how to use turtle files? ) and I need to extract data in order to make a "subgraph" and import it in fuseki. I found that OFFSET could be a solution to get all results of a query if these results are too numerous (about 12M triples).

Here are my questions :

1) I read on the W3C recommendation that OFFSET should be used with ORDER BY :

Using LIMIT and OFFSET (...) will not be useful unless the order is made predictable by using ORDER BY.

(cf. https://www.w3.org/TR/rdf-sparql-query/#modOffset )

-- Unfortunately, ORDER BY seems to be very long on my dataset. I found some examples of OFFSET whithout ORDER BY (here's one : Getting list of persons using SPARQL dbpedia), so I tried to use OFFSET alone, it seems to work.

-- I need to be sure that if I repeat the same query with, I'll get all results. Therefore I've tried on a sample, and checked that the results give distinct values and the expected number, everything seems ok. So I assume that ORDER BY is needed only if the dataset is modified between 2 queries ("predictable order")?

2) Is the performance dependant on the ratio limit/offset?

-- I tried LIMIT = 100, 1000, 5000, 10000 with the same offset, it seems to be nearly the same speed.

-- Also tried to compare different values for OFFSET, and it seems that the execution time is longer for a big offset (but maybe it's only a problem with TDB : cf : https://www.mail-archive.com/users@jena.apache.org/msg13806.html)

~~~~~~ more info ~~~~~~

-- I use a script with tdbquery and this command :

./tdbquery --loc=$DATASET --time --results=ttl "$PREFIXES construct { ?exp dcterms:title ?titre } where { ?manif dcterms:title ?titre ; rdarelationships:expressionManifested ?exp } limit $LIMIT offset $OFFSET"

-- Dataset : ~168M triples, and ~12M triples with dcterms:title .

~~~~~~~~~~~~~~~~~~~~~~

Thanks in advance

1) that is correct, there is no guarantee that without ORDER BY you will traverse the data in the same order, everything would be triple store dependent. It might work, but formally you can't guarantee it. — UninformedUser, Jan 29 '18 at 12:01
2) a larger OFFSET can be expensive because you have to scan through a larger result of the evaluated query. It's not TDB dependent, other triple stores also don't perform that well, though there might be triple stores with a different behavior — UninformedUser, Jan 29 '18 at 12:03
If you have your own Fuseki, why do you need OFFSET to get all the data? This is only necessary for triple stores that ship with some options to limit the returned result per query. For instance, DBpedia is deployed on Virtuoso and as the public endpoint is a shared service, the limit is 10000. Clearly, this can be configured in the `virtuoso.ini` if you use your own server. — UninformedUser, Jan 29 '18 at 12:06
Thank you very much. About 1), you say "everything would be triple store dependent". I tried with a sample of 500 000 triples, it seems to be ok : do you know if TDB can be used this way? — vvffl, Jan 29 '18 at 12:35
About 2) : "why do you need OFFSET to get all the data". I have my own TDB + fuseki, but I thought that a query with >12M results would "crash" or take hours. And I thought it would be easier to store the results. — vvffl, Jan 29 '18 at 12:39
Query execution of SELECT is nearly always streaming. ORDER BY stops that. DISTINCT is streaming but consumes workspace (RAM). But CONSTRUCT isn't streaming - it needs to build the model. So execute a SELECT query "SELECT REDUCE ?exp ?titre" and build triples locally. The REDUCE is cheap DISTINCT and consumes little space. As you are creating a model, duplicates are irrelevant. — AndyS, Jan 29 '18 at 14:23
Thank you Andy. I'm gonna try that. One more question : what would be the more efficient solution to store the results ? Should I use `INSERT` in a new graph for all results at once? or serialize and upload afterwards? What about `OFFSET` with this query? — vvffl, Jan 29 '18 at 15:32
So, end of my first test after more than 90 min : I had executed this command with java fuseki-server -Xms4096m and -Xmx6144m : `./s-update --service="http://localhost:3030/BnF_text/update" "PREFIX dcterms: PREFIX rdarelationships: INSERT { graph { ?exp dcterms:title ?titre . ?manif rdarelationships:expressionManifested ?exp } } where { select * where { ?manif dcterms:title ?titre ; rdarelationships:expressionManifested ?exp } }"` and I got `500 : GC overhead limit exceeded` — vvffl, Jan 29 '18 at 15:54
I guess this query has to build the model (cf. Andy's comment) and using "select reduce" could help. I'm going to search about "streams and models". — vvffl, Jan 29 '18 at 15:57

vvffl · Accepted Answer · 2018-02-02T13:09:01.873

Thank you AKSW & Andy, your comments helped me to learn about Sparql.

So I tried to use SELECT REDUCED, but it's very long and the process can't be stopped if I don't use OFFSET. Besides, I need to transform the results to produce a new graph (and I want to make other transformations on authors, etc).

I read a some pages about streams, models, and serialization, and found that I could transform the data directly with several updates in the same query. Here is a potential solution : first make a copy of the TDB files, and then use this query in a while loop :

DELETE {
    ?manif dcterms:title ?titre ;
        rdarelationships:expressionManifested ?exp   
}
INSERT { 
    graph <http://titres_1> { 
        ?manif rdarelationships:expressionManifested ?exp .
        ?exp dcterms:titre ?titre 
    }
} 
WHERE {  
    select * where 
        { 
            ?manif dcterms:title ?titre ;
                rdarelationships:expressionManifested ?exp 
        }
    LIMIT 100000 
}

This solution has several advantages :

very simple, no java code (I don't know Jena classes nor Java, and have no time to learn now) nor files processing.
I can stop the process when I need to.
deleting the results each time allows to be sure to retrieve all the matching triples
after each deletion the default graph becomes smaller, so the queries should be more and more efficient

Maybe something more efficient could be done : any idea would be appreciated.

----- EDIT ----------

I've begun to transform the data, using a bash script to repeat the query, and s-get ... | split to export the triples in .nt files. After each export, the "temp" graph is cleared with s-update.

Everything seems to be ok, but

it takes more time than I thought (about 1h for 50 x the query with limit = 10 000).
my TDB files are now much bigger than I thought. As if the deleted triples were not really deleted (are they stored in some "backup" graph? or maybe only the indexes are modified?). Before transformation : ~ 168 300 000 triples in default graph, and 20,6 Go for TDB files. Now : ~ 155 100 000 in default graph, and 55 Go for files...

Therefore, 2 questions :

a) Is this a "normal" behavior? Can I reduce the size of the files (not only a storage problem, I assume it should be faster for the next queries)?
b) Do you know another method, using command-line utilities, that could be faster?

Thanks in advance

LAST EDIT

It seems that the files size and performance depend on parameters that can be set in a tdb.cfg file : see http://jena.apache.org/documentation/tdb/store-parameters.html .

I didn't have any .cfg file in my dataset folder. The first test I made was to add one and change tdb.file_mode to 'direct' : it seems that the size of the files doesn't grow as before. However, it costs more RAM and the speed for queries is lower (even if I increase java -Xms and -Xmx). I think there's a 'tradeoff' between file size and query performance. If I have time, I'll subscribe on jena-users mailing list to ask what's the best 'tuning'.

Conclusion : it was interesting to test the queries, but my dataset is too large; I'm going to make another one from the original xml files with named graphs (but using tdbloader2 doesn't allow to do so) or several smaller datasets.

SPARQL : OFFSET whithout ORDER BY to get all results of a query?

1 Answers1