I'm relatively new to Neo4j and graph databases, so bear with me.
I want to traverse through our Neo4j database in the most efficient way possible. While traversing, I need to do some math along the way on edges and their values. At scale, our database will have possibly millions of nodes and edges, so I'm very concerned with efficiency.
My database has nodes with people where some people are "marked". And there are transactions between these people. See the images below. In my algorithm, I basically take any person and see how much money they've received from every marked person.
Up until this point, I've been using the Neo4j python driver and neomodel to perform these traversals.
To do this, I've created an algorithm which is basically a modified version of recursive depth-first traversal. I recursively traverse deeper and deeper through a node's senders until I can't anymore. When I encounter a "marked person" (e.g. a criminal whose money I want to track), I add a record for them.
As the recursion goes back towards the source, I repeatedly multiply the money sources by the fraction of how much the given node received from its sender. For example, when the recursion returns back to John, I first multiply all of Sally's sources by what fraction of Sally's money was sent to John, which in this case is (3/17) since Sally received 17 dollars and sent 3 dollars to John. I'll then do the same for Frank's sources. I multiply each of his sources by (2/11) since Frank received 11 dollars and John received 2 dollars from Frank.
Here is the python code I wrote to perform this algorithm:
def get_sources(node):
source_record = {}
for sender in node.senders:
# retrieve the edge between these two nodes
transaction = node.senders.relationship(sender)
amount_from_sender = transaction.value
sender_total_received = sender.total_received()
if isinstance(sender, MarkedPerson): # Base Case
source_record[sender.name] = amount_from_sender
if len(sender.senders) > 0: # Recursive Case
sender_sources = get_sources(sender)
for source_name, source_value in sender_sources.items():
# find what fraction of the sender's money they sent to 'node', then
# multiply this by how much money the sender has of this source to find
# how much money 'node' has from the source
amount_from_source = (amount_from_sender / sender_total_received) * source_value
if source_name in source_record:
source_record[source_name] += amount_from_source
else:
source_record[source_name] = amount_from_source
return source_record
Here are some examples of what results it gives:
Result when Querying John: {'Bill': 2.310160427807487, 'Rob': 2.6898395721925135}
Result for Querying John: {'Bill': 2.310160427807487, 'Rob': 2.6898395721925135, 'Sal': 2.6898395721925135}
So I have the following questions:
- Is a traversal of this type possible using cypher queries? From my initial investigations, it seems not to be.
- I've seen people using gremlin to perform similarly complex graph queries. Would this be something worth looking into?
- Are there any other tools for dealing with computations with similarly complex data models which would be better suited for our needs?
- Is there another well-known graph algorithm I could use or adapt to perform this same task?
Any thoughts or comments would be greatly appreciated. Thanks!