5

I am working on a project where I need to perform many merge operations of subgraphs onto a remote graph. Some elements of the subgraph may already exist in the remote graph. I am using py2neo v3, and neo4j.

I tried using the create and merge function of neo4j, and with both I get surprisingly bad performances. Even more surprising, the time taken to merge the subgraph seems to grow quadratically both with the number of nodes and the number of relationships! When the subgraph is too big, the transaction hangs. One thing I should say is that I checked and it is not py2neo that generates a number of cypher statements that grows quadratically with the size of the subgraph. So if something is wrong, it is either with how I am using those technologies, or with neo4j's implementation. I also tried looking at the query plans for the queries generated by py2neo, and did not find any answer in that as to why the query times grow so dramatically, but don't take my word for it since I am relatively non-initiated.

I could hardly find any relevant information on-line so I tried conducting a proper benchmarking where I compared the performances in function of the number of nodes, and topology of the subgraph, depending on whether of I use the merge or create operation and whether I use unique constraints or not. I include below some of the results I got for graphs with "linear" topology, meaning that the number of relationships is roughly the same as the number of nodes (it doesn't grow quadratically). In my benchmark, I use 5 different types of labels for nodes and relationships that I assign randomly, and reuse 30% of nodes that already exist in the remote graph. The nodes I create have only one property that act as an identifier, and I report the performances depending on whether I add a unique constraint on this property or not. All the merging operations are run within a single transaction.

Query times for graphs with a linear topology in function of the number of nodes, using py2neo create function Query times for graphs with a linear topology in function of the number of nodes, using py2neo create function

Query times for graphs with a linear topology in function of the number of nodes, using py2neo merge function

Query times for graphs with a linear topology in function of the number of nodes, using py2neo merge function

As you can see, the time taken seems to grow quadratically with the number of nodes (and relationships).

The question I am having a hard time answering is whether I do something wrong, or don't do something that I should, or if it the kind of performances we should expect from neo4j for these kind of operations. Regardless, it seems that what I could do to alleviate this performance issue is to never try merging big subgraphs all at once, but rather start by merging the nodes batch by batch then the relationships. This could and would work, but I want to get at the bottom of this, if someone has any recommendation or insight to share.

Edit

Here is a list to a gist to reproduce the results above, and others. https://gist.github.com/alreadytaikeune/6be006f0a338502524552a9765e79af6

Edit 2

Following Michael Hunger's questions:

In the code I shared, I tried to write a formatter for neo4j.bolt logs, in order to capture the queries that are sent to the server. I don't have a systematic way to generate query plans for them however.

I did not try without docker and I don't have an SSD. However, considering the size I allocate for the jvm and size of the graph I am handling, everything should fit in RAM.

I use the latest docker image for neo4j, so the corresponding version seems to be 3.3.5

Anis
  • 2,984
  • 17
  • 21
  • cam you share some code ? – logisima Apr 25 '18 at 14:56
  • Yes sure, let me package it a little bit – Anis Apr 25 '18 at 14:57
  • @logisima I added a link to a gist. You will need docker to run it, I use it to deploy the neo4j server. – Anis Apr 25 '18 at 15:06
  • do you have a document with a query log + (visual) query plans? – Michael Hunger Apr 26 '18 at 09:01
  • did you try it without docker, or with an externally mounted data volume on an ssd? – Michael Hunger Apr 26 '18 at 09:03
  • which neo4j version are you using? – Michael Hunger Apr 26 '18 at 09:04
  • @MichaelHunger thank you for your answers, I updated my question. – Anis Apr 26 '18 at 12:36
  • I didn't have this problem, but I have lots of problems using py2neo, and your sample code contains almost the only examples of actually using it. I was able to solve multiple problems by looking at your sample code! – Aaron Bramson Nov 27 '18 at 08:27
  • @AaronBramson I'm glad it served you :). You say you didn't experience this quadratic effect when performing merge operations? – Anis Nov 27 '18 at 17:15
  • @Anis I am not saying that, in my case I am adding each node and edge to the graph individually without using transaction because doing otherwise crashed Python. The problems I was able to solve were much more basic use of py2neo. There is no real documentation or examples of how to perform basic actions, so seeing your working example taught me many of those basic things. – Aaron Bramson Nov 29 '18 at 02:32
  • Now that I've got my code running, I am noticing a severe increase in the time required to merge into my graph (using py2neo v4). At the beginning it took less than 1 minute to process 1000 lines of data, and at row 700,000 it is taking on average about 15 minutes. Each row typically adds 1-2 nodes and 1-2 edges, but has to match 3-6 existing nodes in a tree structure before knowing where to add them. It's slow enough to be unusable (requiring a few weeks to load 1.5M rows of data). – Aaron Bramson Dec 05 '18 at 05:34
  • Maybe your problem can be solved using indexes, or primary keys. – Anis Dec 05 '18 at 09:04

1 Answers1

0

Unfortunately, the merge routine (and a few others) in v3 are a little naive and don't scale well. I have alternatives planned for py2neo v4 that build much more efficient queries instead of (in the case of merge) arbitrarily long sequences of MERGE statements. Version 4 should be released at some point next month (May 2018).

Nigel Small
  • 4,475
  • 1
  • 17
  • 15
  • I understand, however I somewhat came to dismiss py2neo as the source of the behavior I put forward, since, the number of statements spawned by the create and merge statements grow linearly with the number of nodes, and relationships. So I don't really think py2neo is at the source of this kind of quadratic behavior. Besides considering the time scales we are dealing with ~ 1 second, I also don't think hydration/dehydration operations would cause that since they should operate at a much lower time scale ~millisecond. Does that make sense? – Anis Apr 26 '18 at 12:53