Graph DBs vs. Document DBs vs. Triplestores

Question

This is a somewhat abstract and general question. I'm interested in the inherent (as well as implementation-specific) properties of different approaches to persist unstructured data with both lots of internal references (graph-like) and lots of properties (JSON-like).

Since a graph is a superset of a tree, you can look at graph DBs (e.g. Neo4j) as a superset of document DBs (e.g. MongoDB). That is, a graph DB provides all the functionality of a document DB plus additionally also allows loops or has a native pointer type so you don't have to dereference foreign-keys/ids manually. So is there some tipping point that you reach when adding more references to your objects/resources where you're better off with a graph DB but were previously better off with a document store? Are there advantages to document DBs (storage space, performance?) or should you just always go with a graph DB just in case you'll need more references in the future?
Similarly, how do graph DBs and triplestores (e.g. RDF stores) compare? Graph DBs (where nodes and edges have properties) seem to be a superset of the simple triplestores. So for what problems (if any) perform triplestores actually better then, say Neo4j? (One advantage of RDF stores is that there is a standardized query language – SPARQL – although there seem to be a lot of people that don't like SPARQL and thus would call it a disadvantage.)

I guess my question is: The graph model (with properties) seems to be able to neatly express all kinds of data, what is the catch when you enter reality? I suppose the catch of graph DBs is performance, so I'd love to see some numbers or rules of thumb on what kind of slowdowns to expect when loading, querying and modifying data as well as memory, and persistent storage requirements (compared to document and triple stores). Also what about horizontal scalability? I got the impression that there the playing field is quite level.

Do you think it is possible that graphs with their expressibility will become the new default storage model for projects that have not super-large data, or are we doomed for a decade of Polyglot Persistence with RDBMS, JSON stores and Graph DBs living along each other that have to be integrated with even more glue code?

score 12 · Answer 1 · answered Aug 20 '12 at 19:37

I'm not sure I would agree with the sentiment that a lot of people don't like SPARQL. SPARQL 1.0 did have some short comings, but it quite nicely addressed what it was designed for, and the new iteration, SPARQL 1.1, builds upon it adding many constructs from SQL that people expected to see in the original spec including sub-queries, aggregates & update semantics. I think the fact that it's standard and you can expect to see the same parsing & semantics in every triple store, as opposed to dialects of SQL, is a nice feature.

I would also claim that all triple stores are graph databases; you can put properties on specific edges in RDF, albeit not as nicely as you can w/ Neo4j. But triple stores have the advantage of a real query language, a w3c standard data representation which makes it trivial to take your data to another triplestore, and for a number of triple stores, the ability to perform reasoning based on OWL.

I dont know anything about the scalability for most graph db's, but generally, the commercial RDF databases scale quite well. All can scale into the billions of triples, which handles a great many use cases. Though how they handle scale differs wildly from vendor to vendor wrt to scale up or scale out, clustering, etc. You'll also see pretty different mem & hardware requirements to match the implementations for each. For me, I've tended to just go and grab an EC2 instance, usually a 2XL or 4XL, mount an EBS large enough to hold the data, and I'm pretty well set.

Additionally, some triple stores integrate with Lucene or similar technologies to provide inverted indexes over the data, and many now are starting to include geo-spatial and temporal indexes. These are very useful features that I'm not sure of their availability in something like Neo4j.

With that said, they're not going to scale as well as a relational databases, they're just not as mature. But you're also not going to get screwed when you have "real" amounts of data either. Of course, one of the advantages of triples stores is reasoning, which performing at scale is tricky, but that's much of the reason why the various OWL profiles were created. But you can paint yourself into a corner if you don't think ahead.

I think graph databases, triple stores specifically, can be a pretty good match for a lot of applications that are being built, but I dont think that means that everything should be done with them. Like anything else, they're tools w/ their good points and their bad points, so you kind of have to make the right choice based on your application. But they probably always merit at least a consideration these days.

score 11 · Answer 2 · answered Feb 22 '13 at 15:00

Just a small correction to amk answer: Tinkerpop also contains an adaptor for ArangoDB, see https://github.com/triAGENS/blueprints-arangodb-graph/wiki/Gremlin. So you can use Gremlin queries with ArangoDB.

In general, multi-model databases like ArangoDB or OrientDB allow you to use all the nice features of document databases (schema-free, indexes) together with graph structures. A vertex or an edge is simply a document as in a document database. You can have as many properties or even embedded documents as you like. You can define hash, range, fulltext, or geo indexes on these documents. Or you can forget about the document structure and view your documents as vertices and edges, using Gremlin or some traversal language to investige the underlying graph.

As for the question "are we doomed with polyglot persistence": Independent of the document / graph database question, I belive that RDBMS will be around a little while longer. So, the answer to that question is: "yes, that is very likely".

score 6 · Answer 3 · answered Aug 21 '12 at 09:26

There is an ad hoc standard for graph databases - Tinkerpop, including the Gremlin (imperative) query language, supported by about everything other than ArangoDB.

To muddy the waters further there are also hybrid document-graph databases OrientDB and ArangoDB.

It strikes me that a major difference between storing a child relationship using an edge in a graph database versus as an embedded object in a document database is that with the former the child can be moved to another parent cheaply and without risk of having it appear in two places with two different places.

If you want to muddy them further, the Tinkerpop stack provides support for the Sesame (openrdf.org) Sail API, which lets you use Gremlin over a triplestore. At last check it was not super optimal integration -- they were not compiling gremlin into the query algebra -- but it was a nice POC for using Gremlin w/ RDF. — Michael, Aug 21 '12 at 12:49

Graph DBs vs. Document DBs vs. Triplestores

3 Answers3

Linked