I am working on a project where I need to perform many merge operations of subgraphs onto a remote graph. Some elements of the subgraph may already exist in the remote graph. I am using py2neo v3, and neo4j.
I tried using the create
and merge
function of neo4j, and with both I get surprisingly bad performances. Even more surprising, the time taken to merge the subgraph seems to grow quadratically both with the number of nodes and the number of relationships! When the subgraph is too big, the transaction hangs. One thing I should say is that I checked and it is not py2neo that generates a number of cypher statements that grows quadratically with the size of the subgraph. So if something is wrong, it is either with how I am using those technologies, or with neo4j's implementation. I also tried looking at the query plans for the queries generated by py2neo, and did not find any answer in that as to why the query times grow so dramatically, but don't take my word for it since I am relatively non-initiated.
I could hardly find any relevant information on-line so I tried conducting a proper benchmarking where I compared the performances in function of the number of nodes, and topology of the subgraph, depending on whether of I use the merge or create operation and whether I use unique constraints or not. I include below some of the results I got for graphs with "linear" topology, meaning that the number of relationships is roughly the same as the number of nodes (it doesn't grow quadratically). In my benchmark, I use 5 different types of labels for nodes and relationships that I assign randomly, and reuse 30% of nodes that already exist in the remote graph. The nodes I create have only one property that act as an identifier, and I report the performances depending on whether I add a unique constraint on this property or not. All the merging operations are run within a single transaction.
Query times for graphs with a linear topology in function of the number of nodes, using py2neo create function
Query times for graphs with a linear topology in function of the number of nodes, using py2neo merge function
As you can see, the time taken seems to grow quadratically with the number of nodes (and relationships).
The question I am having a hard time answering is whether I do something wrong, or don't do something that I should, or if it the kind of performances we should expect from neo4j for these kind of operations. Regardless, it seems that what I could do to alleviate this performance issue is to never try merging big subgraphs all at once, but rather start by merging the nodes batch by batch then the relationships. This could and would work, but I want to get at the bottom of this, if someone has any recommendation or insight to share.
Edit
Here is a list to a gist to reproduce the results above, and others. https://gist.github.com/alreadytaikeune/6be006f0a338502524552a9765e79af6
Edit 2
Following Michael Hunger's questions:
In the code I shared, I tried to write a formatter for neo4j.bolt logs, in order to capture the queries that are sent to the server. I don't have a systematic way to generate query plans for them however.
I did not try without docker and I don't have an SSD. However, considering the size I allocate for the jvm and size of the graph I am handling, everything should fit in RAM.
I use the latest docker image for neo4j, so the corresponding version seems to be 3.3.5