Improve performance removing TinkerGraph vertices

Question

I have a graph g with 600k vertices and 950k edges. After some processing, I need to clean up about 350k+ vertices with this query:

g.V().hasLabel(LABEL_INTERMEDIATE_COLUMN).not(inE(EDGE_DEPEND)).drop().iterate();

Even though I'm excluding vertices with no "depend" edges, they are still connected with other edges.

Using Java, tinkerpop/tinkergraph 3.4.6.

Currently it is taking about 45 minutes to drop all those vertices.

I did a java profiling and the results shows 73% of time spent in the TinkerVertex.remove method, and the rest in the ExpandableStepIterator.next

Is there something like "bulk drop"? Would a JanusGraph or other graph provider be much faster?

score 2 · Accepted Answer · answered Jun 19 '21 at 11:05

It's unlikely that there are graphs that are much faster than TinkerGraph as TinkerGraph is a pure in-memory implementation. You might find one that is more efficient using that memory like OverflowDB which originally forked from TinkerGraph but I don't know that it will make this particular query go faster.

TinkerGraph, nor any graph I know of, has a filtered "bulk drop" operation.

The "not" style global query here is simple expensive as you're having to touch the a large portion of the graph. Of course, I'm a bit surprised that TinkerGraph is taking that long for a graph with less than one million edges. You didn't mention if you were experiencing a lot of GC as you were doing your profile. Perhaps that is an issue? If so, I would try to adjust your JVM memory configurations - maybe you just need a larger -Xmx value or something simple like that.

From a query perspective, you could try to invert not() portion of the traversal to positively find the things you want to remove. It might lead to a less concise query to read but could perhaps speed things up, but on the other hand you are still trying to delete 50% of your data so the cost may not be just in finding the vertices to get rid of.

One other thought would be try to parallelize the drop(). You might hit concurrency errors so you could need a retry strategy but you could consider taking the Iterator of g.V().hasLabel(LABEL_INTERMEDIATE_COLUMN).not(inE(EDGE_DEPEND)) and then delegating calls to each (or batches of) Vertex.remove() to a separate worker thread.

Memory allocation was fine, changing provider seems too intrusive for low chance of improvement, query refactoring the same (75% of time was on actual delete). That left me with parallelizing it. From 45+- minutes down to 8+- minutes! (the query changed a bit as well) — GriffoGoes, Jun 21 '21 at 21:17

score 1 · Answer 2 · answered Jun 21 '21 at 21:34

Based on accepted answer, a simple parallelization improved enough that this operation is no longer the most critical time-wise

For future reference, this:

g.V().hasLabel(LABEL_INTERMEDIATE_COLUMN).not(inE(EDGE_DEPEND)).drop().iterate();

is now something like this:

ExecutorService executor = Executors.newFixedThreadPool(4);
int iterator = 0;
final int batchsize = 10000;
Long count = g.V().hasLabel(LABEL_INTERMEDIATE_COLUMN).not(inE(EDGE_DEPEND)).count().next();
List<Callable<Object>> callableList = new ArrayList<Callable<Object>>();

// splitting current set into tasks to be executed in para
while (iterator * batchsize < count) {
    final Set<Object> vSet =  g.V().hasLabel(LABEL_INTERMEDIATE_COLUMN).not(inE(EDGE_DEPEND)).skip(iterator * batchsize).limit(batchsize).id().toSet();
    callableList.add(() -> g.V(vSet).drop().iterate());
    iterator++;
}
List<Future<Object>> results = executor.invokeAll(callableList);

After some tests, I decided to keep the iteration in a single thread. That way the distributed tasks are really independent of each other (e.g.: one task completing won't affect other task query).

Keep in mind that the actual removal is still single thread, as the vertex node map modification is behind concurrent access locks.

The effect is that increasing threads won't get better results (personally tried 8). And based on some thread dumps, even 4 might be too much (there is always 1 or more thread in a waiting status) - although I did get a dump with 3 threads running!

Improve performance removing TinkerGraph vertices

2 Answers2

Linked