3

I need help solving a problem and to better understand the mechanics of Neo4j. The text below is long because I have tried to detail my problem and my attempts as much as possible.

To introduce the problem, take the simple structure below as an example. Each node is labelled as a Point and has 2 attributes: point_id and num, and point_id is used as a node representative.

Simple graph with labelled nodes

CREATE (a:Point {point_id: 1, num: 1}), 
(b:Point {point_id: 2, num: 1}), 
(c:Point {point_id: 3, num: 1}), 
(d:Point {point_id: 4, num: 2}), 
(e:Point {point_id: 5, num: 2}), 
(f:Point {point_id: 6, num: 1}), 
(g:Point {point_id: 7, num: 2}), 
(h:Point {point_id: 8, num: 2}), 
(i:Point {point_id: 9, num: 1}), 
(a)-[:NEXT]->(b),
(a)-[:NEXT]->(c),
(c)-[:NEXT]->(d),
(b)-[:NEXT]->(e),
(b)-[:NEXT]->(f),
(f)-[:NEXT]->(g),
(f)-[:NEXT]->(i),
(g)-[:NEXT]->(h),
(h)-[:NEXT]->(i);

Let's say I want to select all paths/nodes in paths starting from Point with point_id = 1, where all nodes in a path need to satisfy the same filtering condtion (like num = 1). In the above graph, the result would be (point_id as a node representative):

Query result

The query below return the expected result:

MATCH p=((a:Point {point_id: 1})-[:NEXT*]->(b:Point))
WHERE ALL (x IN NODES(p) WHERE x.num = 1)
RETURN p

Now let's consider my real environment: I have a graph representing a road network with approximately 2 million nodes and 2.8 million relationships (or twice that amount if represented in a bidirectional way). Each node has 3 attributes with the following weighted distribution: 0 (30%), 1 (30%), 2 (20%), 3 (14%), 4 (5%) and 5 (1%). Any attribute can be used as a filter with any possible value.

The node structure is:

(:Intersection {
intersection_id: int, 
num_hotels: int, 
num_restaurants: int, 
num_gas_stations: int
})

The relationships are labelled as [:CONNECTED_TO]. Nodes represent intersections/endpoints and the relationships represent roads. The attribute intersection_id is indexed.

I've tried to solve the problem exemplified above for some time but didn't succeed. Without specifying a maximum depth for traversing, Neo4j's memory usage explodes and the query runs indefinitely until I cancel the operation or end the process (sometimes I need to do this because the system nearly freezes due to the lack of available memory). I understand this behavior because query complexity increases exponentially with each new depth level accessed.

However, the operation is still quite costly even if I set a maximum depth level, such as 15 levels. My graph has nodes whose queries of up to 10 to 15 levels starting from them can involve both high amounts of nodes, such as 1/4 of the database, or relatively low amounts, such as few thousands of nodes (<2k). The mentioned behavior happens in both cases.

I'm using the values ​​with 20 or 30% of distribution as a filter, since it's difficult to map the values with 5% or 1% of distribution due to the size of the graph. I tried to filter with the attribute num_hotels with and without an index.

Below are some of the queries I've tried:

MATCH p=((a:Intersection {Intersection: 562818})-[:CONNECTED_TO*]->(b:Intersection))
WHERE ALL (x IN NODE(p) WHERE x.num_hotels = 0)
RETURN p

MATCH path=(a:Intersection {intersection_id: 562818})-[:CONNECTED_TO*]->(b:Intersection {num_hotels: 0})
WHERE ALL(x IN NODES(path) WHERE SINGLE(y IN NODES(path) WHERE y = x))
RETURN p;

For some cases, where the number of nodes involved in the traversing was low (300-600), I obtained plausible results, but queries didn't always execute normally. Sometimes transactions often seemed to freeze, so I had to end and start them again to have a result, and this was not always guaranteed.

I would like tips to solve the problem, as well as some explanation about the behavior of Neo4j in this type of operation. The impression I've had so far is that Neo4j is looking for all the paths and only then applying the filter regardless of how I organize the query.

Neo4j version: 3.3.1
OS: Linux Mint Cinnamon 18.2
Memory: 6GB, about 4.5GB available for the tests.

Bruno Peres
  • 15,845
  • 5
  • 53
  • 89
  • 1
    You can use EXPLAIN on the query to see how the planner will execute it. Pay attention to what is filtered after your var-length expansions, that will show what will be filtered after-the-fact vs checked during expansion. Usually ALL() and NONE() check during expansion, but ALL(... WHERE SINGLE()) typically is filtered after. If you want to enforce that each node is only visited once in a path (and no filtering on properties), check out `apoc.path.expandConfig()` in [APOC Procedures](https://neo4j-contrib.github.io/neo4j-apoc-procedures/#_expand_paths) using 'NODE_PATH' uniqueness. – InverseFalcon Jan 09 '18 at 11:13

0 Answers0