1

I'm new to fuseki and want to use 2 TDB datasets for our project : a small one for our own data, and a large one (168 M triples, imported data from http://data.bnf.fr).

We need to index the data because SPARQL queries using "FILTER(CONTAINS())" don't work on the large dataset ("BnF_text"). Therefore, I've built a text index for "BnF_text", following this post : Fuseki indexed (Lucene) text search returns no results (but I had to modify the turtle config file to get the text:query working).

It works, but I've encountered a strange problem with "BnF_text" : from time to time, the same query returns a timeout, and I can't see find error in fuseki logs nor apache logs.

~~~~~~~ Here are my questions : ~~~~~~~

  • is there a problem with my config files?
  • is the performance affected by the coexistence of 2 datasets?

~~~~~~~ Here are the details of my installation : ~~~~~~~

  • modified Java memory limit in script fuseki-server : set to --Xmx4000M .
  • SPARQL queries are sent via PHP EasyRDF library
  • I have 2 config files : $FUSEKI_PATH/text_config.ttl + $FUSEKI_PATH/run/configuration/MY_DATASET.ttl
  • I run fuseki-server with this command : ./fuseki-server --config text_config.ttl

Config files

1) text_config.ttl

@prefix :        <#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix text:    <http://jena.apache.org/text#> .
@prefix fuseki:  <http://jena.apache.org/fuseki#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix dcterms: <http://purl.org/dc/terms/> .

## Initialize TDB --------------------------------

[] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
tdb:GraphTDB    rdfs:subClassOf  ja:Model .

## Initialize text query -------------------------------------
[] ja:loadClass       "org.apache.jena.query.text.TextQuery" .
# A TextDataset is a regular dataset with a text index.
text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
# Lucene index
text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .

## ---------------------------------------------------------------
## This URI must be fixed - it's used to assemble the text dataset.

:text_dataset rdf:type     text:TextDataset ;

    text:dataset :tdb_dataset_readwrite ;
    text:index     <#indexLucene> ;
    .

# A TDB datset used for RDF storage ------------------------------
:tdb_dataset_readwrite                    # <= EDIT : instead of <#dataset>  
        a             tdb:DatasetTDB ;
        tdb:location  "TDB_PATH" ;
.

# Text index description ------------------------------------------
<#indexLucene> a text:TextIndexLucene ;
    text:directory <file:LUCENE_PATH> ;
    text:entityMap <#entMap> ;
    text:storeValues true ;
    .

# Mapping in the index ---------------------------------------------
# URI stored in field "uri" 
<#entMap> a text:EntityMap ;
    text:entityField      "uri" ;
    text:defaultField     "text" ;
    text:map (
         [ text:field "text" ; text:predicate dcterms:title ]
         [ text:field "text" ; text:predicate foaf:familyName ]
         [ text:field "text" ; text:predicate foaf:name ]
         ) .

# Fuseki services (http) --------------------------------------------- 

# EDIT : added following lines

:service_tdb_all  a                   fuseki:Service ;
        rdfs:label                    "TDB BnF_text" ;
        fuseki:dataset                :text_dataset ; ### 
        fuseki:name                   "BnF_text" ;
        fuseki:serviceQuery           "query" , "sparql" ;
        fuseki:serviceReadGraphStore  "get" ;
        fuseki:serviceReadWriteGraphStore " .

2) MY_DATASET.ttl

@prefix :      <http://base/#> .
@prefix tdb:   <http://jena.hpl.hp.com/2008/tdb#> .
@prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix ja:    <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> .
@prefix fuseki: <http://jena.apache.org/fuseki#> .

:service_tdb_all  a                   fuseki:Service ;
        rdfs:label                    "TDB MY_DATASET" ;
        fuseki:dataset                :tdb_dataset_readwrite ;
        fuseki:name                   "MY_DATASET" ;
        fuseki:serviceQuery           "query" , "sparql" ;
        fuseki:serviceReadGraphStore  "get" ;
        fuseki:serviceReadWriteGraphStore
                "data" ;
        fuseki:serviceUpdate          "update" ;
        fuseki:serviceUpload          "upload" .

:tdb_dataset_readwrite
        a             tdb:DatasetTDB ;
        tdb:location  "MY_DATASET_TDB_PATH" .

Thanks in advance

Stanislav Kralin
  • 11,070
  • 4
  • 35
  • 58
vvffl
  • 73
  • 1
  • 9
  • As shown, query timeouts aren't set at all. Make sure you are running in a clear area (no old run/ setup). – AndyS Jan 18 '18 at 11:01
  • Thanks a lot for your quick reply, I'm gonna try asap. But can you explain why the same query works normally some times? [PS, I'm new on StackOverflow too, tell me if my posts or comments have to be improved] – vvffl Jan 18 '18 at 11:42
  • Added `[] rdf:type fuseki:Server ; ja:context [ ja:cxtName "arq:queryTimeout" ; ja:cxtValue "50000000,100000000" ] .` in text_config.ttl, it seems to be much better. Also increased memory for java to 8 Go (I saw an example here : https://github.com/NatLibFi/Skosmos/wiki/FusekiTuning). So I have 2 more questions : 1) Is my query too complex for a large dataset? (too many nodes? see below) 2) tried to set timeout to 1000 ms, and saw that the query returns 0 results (no error), is it a normal behavior? – vvffl Jan 18 '18 at 13:51
  • 1
    I don't know why it should work sometimes unless the timeout is set somewhere. The default is no timeouts. There is no limit on the query complexity. Timeouts may cause no results (if the first limit is hit - check the HTTP status code) or truncated results (the results output is intentionally made illegal syntax - the status code, which is sent first, is "200 OK" and, a feature of HTTP, can't be changed. – AndyS Jan 18 '18 at 13:57
  • Here is my query : `SELECT DISTINCT ?titre ?nom ?prenom ?dateEdition ?manif where { ?alias foaf:familyName "Spinoza" . ?alias foaf:familyName ?nom . OPTIONAL { ?alias foaf:givenName ?prenom } ?alias owl:sameAs ?auteur . ?oeExpr2 dc_t:contributor ?auteur . ?oeExpr1 owl:sameAs ?oeExpr2 . ?manif rdarel:expressionManifested ?oeExpr1 . ?manif text:query ( dcterms:title "éthique" ) . ?manif dc_t:title ?titre . ?manif dc_t:date ?dateEdition . } order by desc(?dateEdition) limit 100` – vvffl Jan 18 '18 at 13:57
  • @AndyS : thx. The http code is 200 (info in shell). Could it be linked to the text index? I've noticed some problems (find 2 words of a title but not the third...) – vvffl Jan 18 '18 at 14:00
  • By the way, you talked about "clear area" : I checked that no PID file was left when restarting fuseki with new parameters, but maybe it's not "clear" enough. – vvffl Jan 18 '18 at 14:05
  • Another question : is `--Xmx8G` a normal setting for a large dataset? – vvffl Jan 18 '18 at 14:07
  • Maybe the problem is not with Fuseki but EasyRDF : here is the error message : `Type: EasyRdf_Exception. Message: Request to localhost:3030 timed out` – vvffl Jan 18 '18 at 14:28

1 Answers1

1

Thanks Andy, you were right. The problem came from EasyRDF and not from Fuseki. I found this : https://groups.google.com/d/msg/skosmos-users/WhtZwnsxOFs/MtAocr8vDgAJ , so changed timeout in vendor/easyrdf/easyrdf/lib/EasyRdf/Http/Client.php, and everything seems to be ok now. I'm going to make a few more tests and then try to mark the question as solved.

EDIT: 'everything seems to be ok now' = the "timeout" message from EasyRdf_Exception has disappeared

vvffl
  • 73
  • 1
  • 9
  • I still have a problem : I tried to send the same query several times with a script (using fuseki s-query in a while loop), and noticed that the execution time is often different at the first request. For instance, once I had 76 seconds for first request, and average +/- 1s for the other ones. – vvffl Jan 19 '18 at 16:24
  • Other question : when I use **s-query**, it's HTTP **GET** ; but in our application we use **Ajax POST** + EasyRDF ; I haven't made any stats, but it seems that the execution time is longer with Ajax POST. Could this affect the performance? Or maybe I have to change our Apache configuration? – vvffl Jan 19 '18 at 16:31
  • 1
    The first request touches the data first, following request can benefit from caching. Many triple stores do have this behavior and mostly have some "warmup phase". – UninformedUser Jan 29 '18 at 12:08
  • Thank you for the explanation. I had made a workaround to improve performance : send a dummy query to "awake" the triple store. This works on my virtual machine, but not on our server (maybe due to disk access speed). – vvffl Jan 30 '18 at 11:15