I'm using OrientDB 2.0.0 to test its handling of bulk data loading. For sample data, I'm using the GDELT dataset from Google's GDELT Project (free download). I'm loading a total of ~80M vertices, each with 8 properties, into the V class of a blank graph database using the Java API.
The data is in a single tab-delimited text file (US-ASCII
), so I'm simply reading the text file from top to bottom. I configured the database using OIntentMassiveInsert()
, and set the transaction size to 25,000 records per commit.
I'm using an 8-core machine with 32G RAM and an SSD, so the hardware should not be a factor. I'm running Windows 7 Pro with Java 8r31.
The first 20M (or so) records went in quite quickly, at under 2 seconds per batch of 25,000. I was very encouraged.
However, as the process has continued to run, the insert rate has slowed significantly. The slowing appears to be pretty linear. Here are some sample lines from my output log:
Committed 25000 GDELT Event records to OrientDB in 4.09989189 seconds at a rate of 6097 records per second. Total = 31350000
Committed 25000 GDELT Event records to OrientDB in 9.42005182 seconds at a rate of 2653 records per second. Total = 40000000
Committed 25000 GDELT Event records to OrientDB in 15.883908716 seconds at a rate of 1573 records per second. Total = 45000000
Committed 25000 GDELT Event records to OrientDB in 45.814514946 seconds at a rate of 545 records per second. Total = 50000000
As the operation has progressed, the memory usage has been pretty constant, but the CPU usage by OrientDB has climbed higher and higher, keeping consistent with the duration. In the beginning, the OrientDB Java process was using about 5% CPU. It is now up to about 90%, with the utilization being nicely distributed across all 8 cores.
Should I break the load operation down into several sequential connections, or is it really a function of how the vertex data is being managed internally and it would not matter if I stopped the process and continued inserting where I left off?
Thanks.
[Update] The process eventually died with the error: java.lang.OutOfMemoryError: GC overhead limit exceeded
All commits were successfully processed, and I ended up with a little over 51m records. I'll look at restructuring the loader to break the 1 giant file into many smaller files (say, 1m records each, for example) and treat each file as a separate load.
Once that completes, I will attempt to take the flat Vertex list and add some Edges. Any suggestions how to do that in the context of a bulk insert, where vertex IDs have not yet been assigned? Thanks.
[Update 2] I'm using the Graph API. Here is the code:
// Open the OrientDB database instance
OrientGraphFactory factory = new OrientGraphFactory("remote:localhost/gdelt", "admin", "admin");
factory.declareIntent(new OIntentMassiveInsert());
OrientGraph txGraph = factory.getTx();
// Iterate row by row over the file.
while ((line = reader.readLine()) != null) {
fields = line.split("\t");
try {
Vertex v = txGraph.addVertex(null); // 1st OPERATION: IMPLICITLY BEGIN A TRANSACTION
for (i = 0; i < headerFieldsReduced.length && i < fields.length; i++) {
v.setProperty(headerFieldsReduced[i], fields[i]);
}
// Commit every so often to balance performance and transaction size
if (++counter % commitPoint == 0) {
txGraph.commit();
}
} catch( Exception e ) {
txGraph.rollback();
}
}
[Update 3 - 2015-02-08] Problem solved!
If I had read the documentation more carefully I would have seen that using transactions in a bulk load is the wrong strategy. I switched to using the "NoTx" graph and to adding the vertex properties in bulk, and it worked like a champ without slowing down over time or pegging the CPU.
I started with 52m vertexes in the database, and added 19m more in 22 minutes at a rate of just over 14,000 vertexes per second, with each vertex having 16 properties.
Map<String,Object> props = new HashMap<String,Object>();
// Open the OrientDB database instance
OrientGraphFactory factory = new OrientGraphFactory("remote:localhost/gdelt", "admin", "admin");
factory.declareIntent(new OIntentMassiveInsert());
graph = factory.getNoTx();
OrientVertex v = graph.addVertex(null);
for (i = 0; i < headerFieldsReduced.length && i < fields.length; i++) {
props.put(headerFieldsReduced[i], fields[i]);
}
v.setProperties(props);