I'm researching native graph databases and triple stores (RDF stores) for our use. We’re currently focused on Marklogic
for triple store, and Neo4j
, and maybe OrientDB
for the native graph db.
Part A of this Q below is laying out the context-- i’m investigating a major distinction between these two types of DBs. I’m looking for verification on this first part-- whether i’m missing anything in this picture.
The second part-- Part B, I’m looking for answers on how much each DB has how much of those i’m outlining in Part A.
Part A:
AFAIK so far, a major distinction is-- triple-stores store relationships, or rather edges, based on the relationship itself. So, it's a "bag" of edges, each with a specific, well designed attributes on them to reflect the semantics of that relationship. Native graph dbs on the other hand, store the graph structure-- nodes and links on them, along with the attributes you'd like to define on these nodes and links.
I think, the following two would set two extremes for a fair view of these two. the following two are extremes-- i'm pretty sure the dbs out there are doing more than either one of these extremes.
1.) bag of edges (triple store): in the overall, each subject-predicate-object triple, say (sourceNode, edge, destNode)
is stored as a single record, forming a triple store entry. The triple store is indexed on each of these 3 columns, so when i need a list of people who have friends that live in Australia, i (or rather, the triple store engine) quickly gets the “friends” relationships and among them, searches the ones that have a source or dest node where the node is a person and has the property “lives in Australia”.
2.) native graph: nodes with labels and properties, and the links in between. in order to find people "who have friends that live in Australia", i first find nodes that are labeled as "person", then i search the relationship list (which is a linked list (?)) of that node, and go from there. This is 2 searches, one on nodes and the second on the relationships of that node, as opposed to one search on the relationships (triples) of triple-stores.
One thing I kept seeing on the blogs so far as to the pros and cons of triple stores vs native graph dbs is, triplestores score on queries because of their indexing: the relationships can quickly be accessed. in a native graph db, relationships are accessed through nodes that they are incident to. (i'm aware that, by this very same token, native graph dbs have the advantage of retaining the graph structure so that graph algorithms and solutions can be implemented easier and run faster.)
However, the lack of indexing does not necessarily have a be a shortcoming of a native graph db if it allows indexing of nodes and/or relationships based on their properties and/or on their labels.
if it allows labeling of nodes and indexes on those labels, I as the developer can take a subgraph of the overall graph and go from there. Such query on a restricted domain would be much faster.
if it allows labeling of relationships, those queries "revolving around” relationships, like “list of people who have friends that live in Australia” above can execute faster. because the query won't traverse links from the nodes and look up the properties of nodes, but instead will look up and access links directly.
I am wondering how much of these are Marklogic
, Neo4j
and OrientDB
doing?
I skimmed thru Chapter 6 of this book on Neo4j
and haven’t seen anything about a direct search on an index of edges (relationships.) Have I missed anything?
If I did miss it and Neo4j
has such indexing on edges, how come triple stores have the major advantage of fast queries over native graph dbs?
TIA.
//----------------------
EDIT:
Note: I've seen Graph DBs vs. Document DBs vs. Triplestores among some other useful discussions.