Neo4j scalability and indexing

Question

An argument in favor of graph dbms with native storage over relational dbms made by neo4j (also in the neo4j graph databases book) is that "index-free adjacency" is the most efficient means of processing data in a graph (due to the 'clustering' of the data/nodes in a graph-based model).

Based on some benchmarking I've performed, where 3 nodes are sequentially connected (A->B<-C) and given the id of A I'm querying for C, the scalability is clearly O(n) when testing the same query on databases with 1M, 5M, 10M and 20M nodes - which is reasonable (with my limited understanding) considering I am not limiting my query to 1 node only hence all nodes need to be checked for matching. HOWEVER, when I index the queried node property, the execution time for the same query, is relatively constant.

Figure shows execution time by database node size before and after indexing. Orange plot is O(N) reference line, while the blue plot is the observed execution times.

Based on these results I'm trying to figure out where the advantage of index-free adjacency comes in. Is this advantageous when querying with a limit of 1 for deep(er) links? E.g. depth of 4 in A->B->C->D->E, and querying for E given A. Because in this case we know that there is only one match for A (hence no need to brute force through all the other nodes not part of this sub-network).

As this is highly dependent on the query, I'm listing an example of the Cypher query below for reference (where I'm matching entity labeled node with id of 1, and returning the associated node (B in the above example) and the secondary-linked node (C in the above example)):

MATCH (:entity{id:1})-[:LINK]->(result_assoc:assoc)<-[:LINK]-(result_entity:entity) RETURN result_entity, result_assoc

UPDATE / ADDITIONAL INFORMATION

This source states: "The key message of index-free adjacency is, that the complexity to traverse the whole graph is O(n), where n is the number of nodes. In contrast, using any index will have complexity O(n log n).". This statement explains the O(n) results before indexing. I guess the O(1) performance after indexing is identical to a hash list performance(?). Not sure why using any other index the complexity is O(n log n) if even using a hash list the worst case is O(n).

Downvoting is easy to do; commenting why would be more helpful to formulate the question better. — dter, Nov 14 '17 at 13:50
The Neo4j book misunderstands & misrepresents the RM (relational model). One of its misconceptions is that the RM has anything to do with implementation--relations are an abstraction. Anyway their "index" is a straw man. The benefit of having multiple implementation structures available is improved querying when you *don't* want to just scan a bunch of values. Ultimately, they're just saying that they are optimized for certain use cases for which RDBMSs aren't. The tail wags the dog. https://stackoverflow.com/a/28799902/3404097 — philipxy, Nov 15 '17 at 11:29
@philipxy the idea of the test was to get an idea of how execution time varies with number of nodes - as a trend overall more than absolute values - hence why architecture and system was kept constant in this case. The linearity of the scalability made me think: whats the advantage of index free adjacency? If an indexed key value at worst is already linear. That’s why I ran the indexed tests to see the difference. This gets almost O(1) - which is the best performance of a hash list. I don’t see how the index free adjacency is being advantageous. — dter, Nov 15 '17 at 12:52
I guess it boils down to the question how does a traditional indexed rdbms perform in comparison to the linear non indexed performance obtained here? (This would hint on an answer to the question: whats the performance impact of index free adjacency) — dter, Nov 15 '17 at 12:54

score 2 · Accepted Answer · answered Nov 15 '17 at 20:32

From my understanding, the index-free aspect is only pertinent for adjacent nodes (that's why it's called index-free adjacency). What your plots are demonstrating, is that when you find A, the additional time to find C is negligible, and the question of whether to use an index or not, is only to find the initial queried node A.

To find A without an index it takes O(n), because it has to scan through all the nodes in the database, but with an index, it's effectively like a hashlist, and takes O(1) (no clue why the book says O(n log n) either).

Beyond that, finding the adjacent nodes are not that hard for Neo4j, because they are linked to A, whereas in RM the linkage is not as explicit - thus a join, which is expensive, and then scan/filter is required. So to truly see the advantage, one should compare the performance of graph DBs and RM DBs, by varying the depth of the relations/links. It would also be interesting to see how a query would perform when the neighbours of the entity nodes are increased (ie., the graph becomes denser) - does Neo4j rely on the graphs never being too dense? Otherwise the problem of looking through the neighbours to find the right one repeats itself.

Neo4j scalability and indexing

1 Answers1