1

I'm using Jena TDB to maintain call-dependency structures of various software projects. After statically analyzing large projects, it may be the case that I need to add 100k statements to a dedicated Jena Model within the TDB-backed dataset, maybe even millions in extreme cases.

The issue

Adding 300k statements takes about 11 minutes. Imagine how long it would take to add 3M. I wonder if there's another approach to add this many statements, or another technology altogether.

What I've tried

  • Added all statements using model.add(List<Statement> stmts) - throws an java.lang.OutOfMemoryError and hogs the dataset due to the acquired write lock.
  • Added all statements in chunks of e.g. 1000, while committing and releasing the lock in between. Works but takes forever as stated above, I presume due to the overhead of transactional write-ahead logging.
  • Added the statements to a temporary, fresh and TDB-backed model non-transactionally, then replaced the old model with the new one. RAM usage rises exorbitantly and slows down the whole system.

Side Questions

  • Are there alternatives to Jena/RDF you would recommend for this use case?
  • Will Jena be scaleable w.r.t. distributed file systems/computing?

Other information

I'm using transactions which is probably a major slowdown factor due to heavy I/O. Can't avoid that though, since it's "once transactional, always transactional".

A TDB-backed dataset can be used non-transactionally but once used in a transaction, it must be used transactionally after that.

Thanks for any tips, dearly appreciate it.


Code & Tests

Upon @AndyS's recommendation I retried adding all statements in a single transaction like so:

List<Statement> statements = ...;

//Print statistics
System.out.println("Statement count: " + statements.size());

//Log the start of the operation to measure its duration
Long start = System.currentTimeMillis();

//Add all statements in one transaction
workspace.beginTransaction(ReadWrite.WRITE); //forwards to dataset.begin(ReadWrite rw)
try {
    model.add(statements);
} catch (Exception e) {
    e.printStackTrace();
    workspace.abortTransaction(); //forwards to dataset.abort()
} finally {
    workspace.commitTransaction();  //forwards to dataset.commit()
}

//Check how long the operation took
double durationS = (System.currentTimeMillis() - start) / 1000.0;
System.out.println("Transaction took " + durationS + " seconds.");

This is the output:

Statement count: 3233481

The thread this transaction runs in crashes with the following message in the debugger:

Daemon Thread [ForkJoinPool-1-worker-1] (Suspended (exception OutOfMemoryError))

Bumping the heap space to 4GB circumvents this issue, but still hogs the dataset for almost two minutes.

Statement count: 3233481
Transaction took 108.682 seconds.

Using TDBLoader will most likely behave the same way (indicated here), but aside from that does not support transactions anyway which I would like to have to prevent dataset corruption.

Community
  • 1
  • 1
Double M
  • 1,449
  • 1
  • 12
  • 29
  • The `JDWP exit error` is sonekind of platform system error - it is not Jena related. There are quite a few Google hots for this. – AndyS Nov 14 '16 at 10:58

3 Answers3

2

If you are using transactions, use one transactions to cover the whole of loading 300k statements. 300k isn't usually very large (nor is 3M) unless it has many, many very large literals.

A single Model.add(Coillection) should work.

Or to load from a file:

dataset.begin(ReadWrite.WRITE) ;
 try {
   RDFDataMgr.read(dataset, FILENAME);
   dataset.commit() ;
 } finally { 
   dataset.end() ; 
 }

There is also a bulkloader for offline loading. It is a separate programme tdbloader.

There isn't a Model.add(Collection) - there is a Model.add(List). Put that inside the transaction loop.

dataset.begin(ReadWrite.WRITE) ;
 try {
   dataset.getDefaultModel().add(...)
   dataset.commit() ;
 } finally { 
   dataset.end() ; 
 }

There is a new API in Jena 3.1.1. http://jena.apache.org/documentation/txn/txn.html

AndyS
  • 16,345
  • 17
  • 21
  • Thanks a lot for your advice; I tried your approach again after increasing the heap size (see updated question), but it still takes too long for a single transaction (others are blocked for the duration). And yes, it's `List` not `Collection`, my bad. – Double M Nov 14 '16 at 16:08
2

Jena TDB insertion is costly because it is creating a lot of indexes (more or less all combinations of graph, subject, predicate, object). The emphasis is on quick data access, not quick data insertion.

I ended up using a SSD in order to have acceptable insertion times.

As for alternatives I can point :

  • RDF4J (previously known as SESAME) that allows to select the indexes you want in your database.
  • Parliament (http://parliament.semwebcentral.org/) that is based on Berkeley DB as NoSQL database backend and seemed quite fast for insertion.
daxid
  • 312
  • 1
  • 4
  • RDF4J is an excellent hint, thank you very much. Do you by any chance know whether multiple application instances can access the same data store (e.g. using Cumulus RDF) simultaneously? – Double M Nov 18 '16 at 13:01
  • I can see on [https://code.google.com/archive/p/cumulusrdf/](https://code.google.com/archive/p/cumulusrdf/) that : "CumulusRDF comprises a SesameSail implementation, see CodeExamples wiki page." This is what you need to connect to the RDF4J quad-store. Cumulus RDF seems interesting, I will look further into it. – daxid Nov 22 '16 at 18:12
  • Been there, done that. It's definitely better than Jena because it allows for a distributed storage backend (like Apache Cassandra), but hasn't been maintained for a while due to a lack of funds. I'm currently switching to Titan with a DynamoDB backend (supports Cassandra, too). No RDF but Graph DB; For my purpose **much** more useful and immensely scalable. Thanks for your insights though, I might as well mark this one as the correct answer. – Double M Nov 22 '16 at 18:51
0

I had the same problem with remote Jena TDB and Fuseki. What have I done is posting(http post) whole data as a file to remote Jena Data endpoint which is

http://FusekiIP:3030/yourdataset/data

PeerNet
  • 855
  • 1
  • 16
  • 31