I'm looking for some general advice on how to either re-write application code to be non-naive, or whether to abandon neo4j for another data storage model. This is not only "subjective", as it relates significantly to specific, correct usage of the neo4j driver in Python and why it performs the way it does with my code.
Background:
My team and I have been using neo4j to store graph-friendly data that is initially stored in Python objects. Originally, we were advised by a local/in-house expert to use neo4j, as it seemed to fit our data storage and manipulation/querying requirements. The data are always specific instances of a set of carefully-constructed ontologies. For example (pseudo-data):
Superclass1 -contains-> SubclassA
Superclass1 -implements->SubclassB
Superclass1 -isAssociatedWith-> Superclass2
SubclassB -hasColor-> Color1
Color1 -hasLabel-> string::"Red"
...and so on, to create some rather involved and verbose hierarchies.
For prototyping, we were storing these data as sequences of grammatical triples (subject->verb/predicate->object) using RDFLib, and using RDFLib's graph-generator to construct a graph.
Now, since this information is just a complicated hierarchy, we just store it in some custom Python objects. We also do this in order to provide an easy API to others devs that need to interface with our core service. We hand them a Python library that is our Object model, and let them populate it with data, or, we populate it and hand it to them for easy reading, and they do what they want with it.
To store these objects permanently, and to hopefully accelerate the writing and reading (querying/filtering) of these data, we've built custom object-mapping code that utilizes the official neo4j python driver to write and read these Python objects, recursively, to/from a neo4j database.
The Problem:
For large and complicated data sets (e.g. 15k+ nodes and 15k+ relations), the object relational mapping (ORM) portion of our code is too slow, and scales poorly. But neither I, nor my colleague are experts in databases or neo4j. I think we're being naive about how to accomplish this ORM. We began to wonder if it even made sense to use neo4j, when more traditional ORMs (e.g. SQL Alchemy) might just be a better choice.
For example, the ORM commit algorithm we have now is a recursive function that commits an object like this (pseudo code):
def commit(object):
for childstr in object: # For each child object
child = getattr(object, childstr) # Get the actual object
if attribute is <our object base type): # Open transaction, make nodes and relationship
with session.begin_transaction() as tx:
<construct Cypher query with:
MERGE object (make object node)
MERGE child (make its child node)
MERGE object-[]->child (create relation)
>
tx.run(<All 3 merges>)
commit(child) # Recursively write the child and its children to neo4j
Is it naive to do it like this? Would an OGM library like Py2neo's OGM be better, despite ours being customized? I've seen this and similar questions that recommend this or that OGM method, but in this article, it says not to use OGMs at all.
Must we really just implement every method and benchmark for performance? It seems like there must be some best-practices (other than using the batch IMPORT, which doesn't fit our use cases). And we've read through articles like those linked, and seen the various tips on writing better queries, but it seems better to step back and examine the case more generally before attempting to optimize code line-by line. Although it's clear that we can improve the ORM algorithm to some degree.
Does it make sense to write and read large, deep hierarchical objects to/from neo4j using a recursive strategy like this? Is there something in Cypher, or the neo4j drivers that we're missing? Or is it better to use something like Py2neo's OGM? Is it best to just abandon neo4j altogether? The benefits of neo4j and Cypher are difficult to ignore, and our data does seem to fit well in a graph. Thanks.