I'm using Jena TDB to maintain call-dependency structures of various software projects. After statically analyzing large projects, it may be the case that I need to add 100k statements to a dedicated Jena Model within the TDB-backed dataset, maybe even millions in extreme cases.
The issue
Adding 300k statements takes about 11 minutes. Imagine how long it would take to add 3M. I wonder if there's another approach to add this many statements, or another technology altogether.
What I've tried
- Added all statements using
model.add(List<Statement> stmts)
- throws anjava.lang.OutOfMemoryError
and hogs the dataset due to the acquired write lock. - Added all statements in chunks of e.g. 1000, while committing and releasing the lock in between. Works but takes forever as stated above, I presume due to the overhead of transactional write-ahead logging.
- Added the statements to a temporary, fresh and TDB-backed model non-transactionally, then replaced the old model with the new one. RAM usage rises exorbitantly and slows down the whole system.
Side Questions
- Are there alternatives to Jena/RDF you would recommend for this use case?
- Will Jena be scaleable w.r.t. distributed file systems/computing?
Other information
I'm using transactions which is probably a major slowdown factor due to heavy I/O. Can't avoid that though, since it's "once transactional, always transactional".
A TDB-backed dataset can be used non-transactionally but once used in a transaction, it must be used transactionally after that.
Thanks for any tips, dearly appreciate it.
Code & Tests
Upon @AndyS's recommendation I retried adding all statements in a single transaction like so:
List<Statement> statements = ...;
//Print statistics
System.out.println("Statement count: " + statements.size());
//Log the start of the operation to measure its duration
Long start = System.currentTimeMillis();
//Add all statements in one transaction
workspace.beginTransaction(ReadWrite.WRITE); //forwards to dataset.begin(ReadWrite rw)
try {
model.add(statements);
} catch (Exception e) {
e.printStackTrace();
workspace.abortTransaction(); //forwards to dataset.abort()
} finally {
workspace.commitTransaction(); //forwards to dataset.commit()
}
//Check how long the operation took
double durationS = (System.currentTimeMillis() - start) / 1000.0;
System.out.println("Transaction took " + durationS + " seconds.");
This is the output:
Statement count: 3233481
The thread this transaction runs in crashes with the following message in the debugger:
Daemon Thread [ForkJoinPool-1-worker-1] (Suspended (exception OutOfMemoryError))
Bumping the heap space to 4GB circumvents this issue, but still hogs the dataset for almost two minutes.
Statement count: 3233481
Transaction took 108.682 seconds.
Using TDBLoader
will most likely behave the same way (indicated here), but aside from that does not support transactions anyway which I would like to have to prevent dataset corruption.