Complex Cypher Traversal with Math

Question

I'm relatively new to Neo4j and graph databases, so bear with me.

I want to traverse through our Neo4j database in the most efficient way possible. While traversing, I need to do some math along the way on edges and their values. At scale, our database will have possibly millions of nodes and edges, so I'm very concerned with efficiency.

My database has nodes with people where some people are "marked". And there are transactions between these people. See the images below. In my algorithm, I basically take any person and see how much money they've received from every marked person.

Up until this point, I've been using the Neo4j python driver and neomodel to perform these traversals.

To do this, I've created an algorithm which is basically a modified version of recursive depth-first traversal. I recursively traverse deeper and deeper through a node's senders until I can't anymore. When I encounter a "marked person" (e.g. a criminal whose money I want to track), I add a record for them.

As the recursion goes back towards the source, I repeatedly multiply the money sources by the fraction of how much the given node received from its sender. For example, when the recursion returns back to John, I first multiply all of Sally's sources by what fraction of Sally's money was sent to John, which in this case is (3/17) since Sally received 17 dollars and sent 3 dollars to John. I'll then do the same for Frank's sources. I multiply each of his sources by (2/11) since Frank received 11 dollars and John received 2 dollars from Frank.

Here is the python code I wrote to perform this algorithm:

def get_sources(node):
    source_record = {}
    for sender in node.senders:
        # retrieve the edge between these two nodes
        transaction = node.senders.relationship(sender)
        amount_from_sender = transaction.value
        sender_total_received = sender.total_received()
        if isinstance(sender, MarkedPerson):  # Base Case
            source_record[sender.name] = amount_from_sender
        if len(sender.senders) > 0:  # Recursive Case
            sender_sources = get_sources(sender)
            for source_name, source_value in sender_sources.items():
                # find what fraction of the sender's money they sent to 'node', then
                # multiply this by how much money the sender has of this source to find
                # how much money 'node' has from the source
                amount_from_source = (amount_from_sender / sender_total_received) * source_value
                if source_name in source_record:
                    source_record[source_name] += amount_from_source
                else:
                    source_record[source_name] = amount_from_source

    return source_record

Here are some examples of what results it gives:

Result when Querying John: {'Bill': 2.310160427807487, 'Rob': 2.6898395721925135}

Result for Querying John: {'Bill': 2.310160427807487, 'Rob': 2.6898395721925135, 'Sal': 2.6898395721925135}

So I have the following questions:

Is a traversal of this type possible using cypher queries? From my initial investigations, it seems not to be.
I've seen people using gremlin to perform similarly complex graph queries. Would this be something worth looking into?
Are there any other tools for dealing with computations with similarly complex data models which would be better suited for our needs?
Is there another well-known graph algorithm I could use or adapt to perform this same task?

Any thoughts or comments would be greatly appreciated. Thanks!

Hayden Flinner · Answer 1 · 2022-12-06T05:55:29.243

It seems representable with variable length patterns 'reduce' and case-when to emulate a ternary for 'is this a criminal or not'. It seems gremlin might have fallen by the wayside as the planner improved; this seems like the sort of thing that the query planner ought not to struggle with, since you in the end need to touch most or all nodes anyway. But assuming your graph has no cycles, this also seems like Python should be fast enough, at least for batched analyses and cached ad-hoc queries against a precomputed set. Millions of nodes ought to easily fit in memory given a reasonable in-memory representation.

This answer might get you started.

Complex Cypher Traversal with Math

1 Answers1