Apache Jena full-text search (with external content)

Question

I would like to configure something like this:

RDF dataset of metadata about books;
Books placed separately like XHTML files, paragraphs with unique IDs;
Every book’s metadata includes something like dc:source link to the file (absolute? like a proper URI, what about scaling?);

I know this could be pretty trivial but I can’t grasp that properly. At the beginning I am trying to index just pure TXT tiny files, every linked from dc:source in the metadata file. As I understand, this should be enough for indexing everything included. I am trying to do it like the guy in this post here. Unlike him, I want to index RDF dataset as well as external files. Especially these two commands log no errors (in contrary, it logs there are 57 triples):

java -cp /home/honza/.apache-jena-fuseki-2.3.0/fuseki-server.jar tdb.tdbloader --tdb=run/configuration/service2.ttl testDir/test_dataset.ttl

INFO  -- Start triples data phase
INFO  ** Load into triples table with existing data
INFO  -- Start quads data phase
INFO  ** Load empty quads table
INFO  Load: testDir/test_dataset.ttl -- 2015/11/13 12:46:22 CET
INFO  -- Finish triples data phase
INFO  ** Data: 57 triples loaded in 0,29 seconds [Rate: 193,22 per second]
INFO  -- Finish quads data phase
INFO  -- Start triples index phase
INFO  -- Finish triples index phase
INFO  -- Finish triples load
INFO  ** Completed: 57 triples loaded in 0,33 seconds [Rate: 172,21 per second]
INFO  -- Finish quads load

and

java -cp /home/honza/.apache-jena-fuseki-2.3.0/fuseki-server.jar jena.textindexer --desc=run/configuration/service2.ttl

WARN  Values stored but langField not set. Returned values will not have language tag or datatype.

After that, server runs properly, I see the graph but it includes no data.

My config for this service is (I don’t know whether it is right to have service and DB config in one file, for me it works better at the moment, dividing throws some errors):

@prefix fuseki:  <http://jena.apache.org/fuseki#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix text:    <http://jena.apache.org/text#> .
@prefix :        <#> .

[] rdf:type fuseki:Server 
.

<#service2> rdf:type fuseki:Service ;
  rdfs:label                        "TDB/text service" ;
  fuseki:name                       "test" ;       # http://host:port/ds
  fuseki:serviceQuery               "sparql" ;   # SPARQL query service
  fuseki:serviceQuery               "query" ;    # SPARQL query service (alt name)
  fuseki:serviceUpdate              "update" ;   # SPARQL update service
  fuseki:serviceUpload              "upload" ;   # Non-SPARQL upload service
  fuseki:serviceReadWriteGraphStore "data" ;     # SPARQL Graph store protocol (read and write)
  # A separate read-only graph store endpoint:
  fuseki:serviceReadGraphStore      "get" ;      # SPARQL Graph store protocol (read only)
  fuseki:dataset                    :text_dataset 
.

[] ja:loadClass   "org.apache.jena.tdb.TDB" .
tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
tdb:GraphTDB    rdfs:subClassOf  ja:Model .

[] ja:loadClass "org.apache.jena.query.text.TextQuery" .

text:TextIndexLucene rdfs:subClassOf  text:TextIndex .
:text_dataset rdf:type text:TextDataset ;
  text:dataset <#test> ;
  text:index <#indexLucene> .

score 1 · Answer 1 · answered Nov 14 '15 at 01:10

Firstly you haven't actually defined a Lucene index explicitly so likely what you get is a transient in-memory index that is thrown away every time your application stops. At a minimum you need the following in your configuration:

# Text index description
<#indexLucene> a text:TextIndexLucene ;
    text:directory <file:path/to/index/> .

Where <file:path/to/index/> points to a directory where you want your text index to be stored.

Secondly you haven't told the text search about how the Lucene index is structured. Even if you have separately created your index from your external files you need to define in your configuration how Jena should use and access that index.

From the documentation you need to define an entity map:

# Mapping in the index
# URI stored in field "uri"
# rdfs:label is mapped to field "text"
<#entMap> a text:EntityMap ;
    text:entityField      "uri" ;
    text:defaultField     "text" ;
    text:map (
         [ text:field "text" ; text:predicate rdfs:label ]
         ) .

The comments in the example from the documentation hopefully describe things fairly well. The text:entityField property is used to specify the field in your index that stores the URI associated with an indexed data i.e. this provides the means to link the text index hits back to the RDF in your triple store. The text:defaultField is used to specify the field containing the indexed data i.e. the field that the text search will actuall searche.

The optional text:map shown here can be used to further customise what fields are searched and allow you to index multiple pieces of content in different fields and then write queries that search your text index in different ways.

Once you have an appropriately defined entity map you need to link it to your index configuration like so:

# Text index description
<#indexLucene> a text:TextIndexLucene ;
    text:directory <file:path/to/index/> ;
    text:entityMap <#entMap> .

With this in place you should actually be able to get results from your index.

Thanks a lot, this makes sense now! I will try and report. However, I am confused with the indexing of documents aside the metadata set. Is the way of pointing at them with a relative link ok? Could they be xhtml or is it better to structure books like another separate RDF datasets? I would like to chunk them per pages or paragraphs. — Honza Hejzl, Nov 14 '15 at 10:40
It works! I must ensure full-text search works as well. As for indexing external content, I am starting to guess it will need much more, like Solr or so... — Honza Hejzl, Nov 14 '15 at 12:10

Apache Jena full-text search (with external content)

1 Answers1