2

I'm looking for some general advice on how to either re-write application code to be non-naive, or whether to abandon neo4j for another data storage model. This is not only "subjective", as it relates significantly to specific, correct usage of the neo4j driver in Python and why it performs the way it does with my code.

Background:

My team and I have been using neo4j to store graph-friendly data that is initially stored in Python objects. Originally, we were advised by a local/in-house expert to use neo4j, as it seemed to fit our data storage and manipulation/querying requirements. The data are always specific instances of a set of carefully-constructed ontologies. For example (pseudo-data):

Superclass1 -contains-> SubclassA
Superclass1 -implements->SubclassB
Superclass1 -isAssociatedWith-> Superclass2
SubclassB -hasColor-> Color1
Color1 -hasLabel-> string::"Red"

...and so on, to create some rather involved and verbose hierarchies.

For prototyping, we were storing these data as sequences of grammatical triples (subject->verb/predicate->object) using RDFLib, and using RDFLib's graph-generator to construct a graph.

Now, since this information is just a complicated hierarchy, we just store it in some custom Python objects. We also do this in order to provide an easy API to others devs that need to interface with our core service. We hand them a Python library that is our Object model, and let them populate it with data, or, we populate it and hand it to them for easy reading, and they do what they want with it.

To store these objects permanently, and to hopefully accelerate the writing and reading (querying/filtering) of these data, we've built custom object-mapping code that utilizes the official neo4j python driver to write and read these Python objects, recursively, to/from a neo4j database.

The Problem:

For large and complicated data sets (e.g. 15k+ nodes and 15k+ relations), the object relational mapping (ORM) portion of our code is too slow, and scales poorly. But neither I, nor my colleague are experts in databases or neo4j. I think we're being naive about how to accomplish this ORM. We began to wonder if it even made sense to use neo4j, when more traditional ORMs (e.g. SQL Alchemy) might just be a better choice.

For example, the ORM commit algorithm we have now is a recursive function that commits an object like this (pseudo code):

def commit(object):
    for childstr in object:             # For each child object
        child = getattr(object, childstr)   # Get the actual object

        if attribute is <our object base type): # Open transaction, make nodes and relationship
            with session.begin_transaction() as tx:
                <construct Cypher query with:
                MERGE object            (make object node)
                MERGE child             (make its child node)
                MERGE object-[]->child  (create relation)
                >
                tx.run(<All 3 merges>)

            commit(child)                   # Recursively write the child and its children to neo4j

Is it naive to do it like this? Would an OGM library like Py2neo's OGM be better, despite ours being customized? I've seen this and similar questions that recommend this or that OGM method, but in this article, it says not to use OGMs at all.

Must we really just implement every method and benchmark for performance? It seems like there must be some best-practices (other than using the batch IMPORT, which doesn't fit our use cases). And we've read through articles like those linked, and seen the various tips on writing better queries, but it seems better to step back and examine the case more generally before attempting to optimize code line-by line. Although it's clear that we can improve the ORM algorithm to some degree.

Does it make sense to write and read large, deep hierarchical objects to/from neo4j using a recursive strategy like this? Is there something in Cypher, or the neo4j drivers that we're missing? Or is it better to use something like Py2neo's OGM? Is it best to just abandon neo4j altogether? The benefits of neo4j and Cypher are difficult to ignore, and our data does seem to fit well in a graph. Thanks.

turanc
  • 199
  • 1
  • 6

2 Answers2

2

It's hard to know without looking at all the code and knowing the class hierarchy, but at the moment I'd hazard a guess that your code is slow in the OGM bit because every relationship is created in its own transaction. So you're doing a huge number of transactions for a larger graph which is going to slow things down.

I'd suggest for an initial import where you're creating every class/object, rather than just adding a new one or editing the relationships for one class, that you use your class inspectors to simply create a graph representation of the data, and then use Cypher to construct it in a lot fewer transactions in Neo4J. Using some basic topological graph theory you could then optimise it by reducing the number of lookups you need to do, too.

You can create a NetworkX MultiDiGraph in your python code to model the structure of your classes. From there on in there are a few different strategies to put the data into Neo4J - I also just found this but have no idea about whether it works or how efficient it is.

The most efficient way to query to import your graph will depend on the topology of the graph, and whether it is cyclical or not. Some options are below.

1. Create the Graph in Two Sets of Queries

Run one query for every node label to create every node, and then another to create every edge between every combination of node labels (the efficiency of this will depend on how many different node labels you're using).

2. Starting from the topologically highest or lowest point in the graph, create the graph as a series of paths

If you have lots of different edge labels and node labels, this might involve writing a lot of cypher logic combining UNWIND and FOREACH (CASE r.label = 'SomeLabel' THEN [1] ELSE [] | CREATE (n:SomeLabel {node_unique_id: x})->, but if the graph is very hierarchical you could also use python to keep track of which nodes have all their lower nodes and relationships created already and then use that knowledge to limit the size of paths that get sent to Neo4J in a query.

3. Use APOC to import the whole graph

Another option, which may or may not fit your use case and may or may not be more performant would be to export the graph to GraphML using NetworkX and then use the APOC GraphML import tool.

Again, it's hard to offer a precise solution without seeing all your data, but I hope this is somewhat useful as a steer in the right direction! Happy to help / answer any other questions based on more data.

Dom Weldon
  • 1,728
  • 1
  • 12
  • 24
  • Updated to add a new possible solution using GraphML – Dom Weldon Jun 21 '18 at 17:38
  • 1
    Thanks for this. Our data is indeed rather hierarchical, and it seems like you're saying it's best to just try to optimize the ORM I've already made. How costly are multiple transactions compared to large transactions with incredibly long query strings? It seems like right now, building a flattened ORM would imply doing the latter. – turanc Jun 22 '18 at 22:06
  • 1
    Again it's a hard thing to answer properly without a specific use case, but in general Cypher is a very capable language and its interactions with Neo4J are optimized really well. For example, if you're iterating through a hierarchy to create lots of nodes/relationships in a very defined pattern, it will likely be orders of magnitude more efficient to do that in Cypher than it will to do it through repeated requests and transactions (perhaps over a network) to create individual edges, and that's not including the inherent benefits of transactionality in the first place... – Dom Weldon Jun 23 '18 at 00:10
  • 1
    Also, regarding adjusting the existing O(R/G)M or doing something different, I'd suggest the best route is to abstract your data from your implementation into your database. If you can store the data in NetworkX, then you're in a stronger position to put it into any data store (although sounds like Neo4J is probably your best bet). Write code to create your data, then write code to store it! That way it's easier to test, and if you decide to change to something else in future you only have to change one smaller module that does one thing rather than one big module that does too many things. – Dom Weldon Jun 23 '18 at 00:14
2

There is a lot going on here so I'll try to address this in smaller questions

Would an OGM library like Py2neo's OGM be better

With any ORM/OGM library, the reality is that you can always get better performance by bypassing them and delving into the belly of the beast. That is not really the ORMs entire job though. An ORM is meant to save you time and effort by making relatively efficient DB use easy.

So it depends, if you want best performance, skip the ORM, and invest your time working on as low a level as you can (*Requires advanced low level knowledge of the beast you are working with, and a lot of your time). Otherwise, an ORM library is usually your best bet.

Our code is too slow, and scales poorly

Databases are complex. If at all possible, I would recommend bringing someone(s) on board to be a company wide database admin/expert. (This is harder when you don't already have one to vet new hires actually know what they are talking about)

Assuming that is not an option, here are some things to consider.

  • IO is expensive. Especially over the network. Minimize data that has to be sent in either direction. (This is why you page return results. Only return the data you need, as you actually need it)
    • Caveat to that, creating request connections is very expensive. Minimize calls to the DB. (Have fun balancing the two ^_^) (Note: ORMs usually have built in mechanics to only commit what has changed)
  • Get to the data you want fast. Create indexes in the database to vastly improve fetch speed. The more unique and consistent the id is, the better.
    • Caveat, indexes have to be updated on writes that alter a value in them. So indexes reduce write speed and eat more memory to gain read speed. Minimize indexes.
  • Transactions are a memory operation. Committing a transaction is a disk IO operation. This is why batch jobs are far more efficient.
    • Caveat, Memory isn't infinite. Keep your jobs a reasonable size.

As you can probably tell, scaling DB operations to production levels is not fun. It's too easy to burn yourself over-optimizing on any axis, and this is just surface level over simplifications.

For prototyping, we were storing these data as sequences of grammatical triples

Less a question, and more a statement, but different types of databases have different strengths and weaknesses. Scheme-less DBs are more specialized for cache stores; Graph DBs are specialized for querying based on relationships (edges); Relational DBs are specialized for fetching/updating records (tables); And Triplestores are more Specialized for, well, triples (RDF); (ect. there are more types)

I mention this because it sounds like your data might be mostly "write once, read many". In this case, you probably actually should be using a Triplestore. You can use any DB type for anything, but picking the best DB requires you to know how you use your data, and how that use can possible evolve.

Must we really just implement every method and benchmark for performance?

Well, this is part of why stored procedures are so important. ORMs help abstract this part, and having an in house domain expert would really help. It could just be that you are pushing the limits of what 1 machine can do. Maybe you just need to upgrade to a cluster; or maybe you have horrible code inefficiencies that have you touching a node 10k times in 1 save operation when no (or 1) value changed. To be honest though, bench-marking doesn't do much unless you know what you are looking for. For example, usually the difference between 5 hours and 0.5 seconds could be as simple as creating 1 index.

(To be fair, while buying bigger and better database servers/clusters may be the inefficient solution, it is sometimes the most cost effective compared to the salary of 1 Database Admin. So, again, depends your your priorities. And I'm sure your boss would probably prioritize differently from what you'd like)


TL;DR

You should hire a domain expert to help you.

If that is not an option, go to the bookstore (or google) pick up Databases 4 dummies (hands on learn databases online tutorial classes), and become the domain expert yourself. (Which you can than use to boost your worth to the company)

If you don't have time for that, probably your only saving grace would be to just upgrade your hardware to solve the problem with brute force. (*As long as growth isn't exponential)

Tezra
  • 8,463
  • 3
  • 31
  • 68