Why aren't TripleStore implemented as Native Graph Store as Property-Graph Store are?

Question

Sparql based store or put another way, TripleStore, are known to be less efficient than property graph store, on top of not being able to be distributed while maintaining performance as property graph.

I understand that there are a lot of things at stake here, such as inferencing and what not. Putting distribution and inferencing aside where we could limit ourself to RDFS which can be fully captured via SPARQL, I am wondering why that is ?

More specifically why is the storage the issue. What is limiting Sparql Based store to store data as Property graph store does, and performing traversal instead of massive join queries. Can't sparql simply be translated to Gremlin steps for instance ? What is the limitation there? Can't the join be avoided ?

My assumption is, if sparql can be translated in efficient step traversal, and data is stored as property graph do, such as as janusGraph does https://docs.janusgraph.org/latest/data-model.html , then the issue of performance would be bridged while maintaining some inference such as RDFS.

This being said, Sparql is not Turing-complete of course, but at least for what it does, it would do it fast and possibly at scale as well. The goal is not to compete in my view, but to benefit for SPARQL ease of use and using traversal language like gremlin for things that really requires it e.g. OLAP.

Is there any project in that direction, has Apache jena considered any of this?

I saw that Graql of Grakn seem to be using that road for the reason I explain above, hence what's stopping the TripleStore community ?

I don't get your question. When is a triple store slower than which other property graph based store? — UninformedUser, Dec 25 '18 at 13:33
Why should a triple store use a property graph model under the hood? A triple store is for RDF data - RDF does not follow a property graph model. — UninformedUser, Dec 25 '18 at 13:34
I think this question is rather complex to understand. It has to do with implementation and the formal semantic of the query language. I'm still on the journey and that is why i ask the question but maybe the following paper would put you on the path https://arxiv.org/abs/1801.02911 (SPARQL querying of Property Graphs using Gremlin Traversals) Gremlinator — MaatDeamon, Dec 25 '18 at 13:38
This has to do with Join vs micro-indices implementation. The sparql alegbra is based on join operator: http://www.inf.unibz.it/~nutt/Teaching/SemTechs1415/SemTechsSlides/6-SPARQL-Semantics.pdf — MaatDeamon, Dec 25 '18 at 13:40
Hope it helps you investigate yourself, and if you have answer please bring it back :) — MaatDeamon, Dec 25 '18 at 13:40
Note that everything is Graph data. At the end it is all graph data structure. You can certainly express your RDF in a property Graph and vice versa. I am more situated at the implementation of backends. Although there is a link between the query language formalism which somehow command how you operate with the data. A graph traversal operates on a graph structure, hence your backend can be graph structure based, while it seems the Algebra of sparql, deal with sets and join and so on, which in my understanding seem to requires other data structure to operate on. — MaatDeamon, Dec 25 '18 at 13:49
There is a lot to unpack there :) The journey is long, i was hoping to accelerate that with stack overflow — MaatDeamon, Dec 25 '18 at 13:52
Note that "Gremlinator" is now sparql-gremlin at Apache TinkerPop - http://tinkerpop.apache.org/docs/3.4.0-SNAPSHOT/reference/#sparql-gremlin — stephen mallette, Dec 31 '18 at 12:50
Thank you. That’s actually good news that it becomes more “official”. If they finish the all thing, that will be a big boost for the semantic stack and implementing knowledge graph with semantic tech. More scalability. — MaatDeamon, Dec 31 '18 at 12:59

MaatDeamon · Answer 1 · 2018-12-26T19:47:58.503

@Michael, I am happy that you step in as you definitely know more than me on this :) . I am on a learning journey at this point. At your request here is one of the paper that inspired my understanding:

arxiv.org/abs/1801.02911 (SPARQL querying of Property Graphs using Gremlin Traversals)

I quote them

"We present a comprehensive empirical evaluation of Gremlinator and demonstrate its validity and applicability by executing SPARQL queries on top of the leading graph stores Neo4J, Sparksee and Apache TinkerGraph and compare the performance with the RDF stores Virtuoso, 4Store and JenaTDB. Our evaluation demonstrates the substantial performance gain obtained by the Gremlin counterparts of the SPARQL queries, especially for star-shaped and complex queries."

They explain however that things depends somehow on the type of queries.

Or as another answer put that in stack overflow Comparison of Relational Databases and Graph Databases would also help understand the issue between Set and path. My understanding is that TripleStore works with Set too. This being said i am definitely not aware of all the optimization technics implemented in TripleStore lately, and i saw several papers explaining technics to significantly prune set join operation.

On distribution it is more a guts feelings. For instance, doing join operation in a distributed fashion sounds very but very expensive to me. I don't have the papers and my research is not exhaustive on the matters. But from what I have red and I will have to dig in my Evernote :) to back it, that's the fundamental problem with distribution. Automated smart sharding here seems not to help alleviate the issue.

@Michael this a very but very complex subject. I'm definitively on the journey and that's why i am helping myself with stackoverflow to guide my research. You probably have an idea of as to why. So feel free to provides with pointers indeed.

This being said, I am not saying that there is a problem with RDF and that Property-Graph are better. I am saying that somehow, when it comes to graph traversal, there are ways of implementing a backend that makes this fast. The data model is not the issue here, the data structure used to support the traversal is the issue. The second thing that i am saying is that, it seems that the choice of the query language influence how the "traversal" is performed and hence the data structure that is used to back the data model.

That's my understanding so far, and yes I do understand that there are a lot of other factor at play, and feel free to enumerate some of them to guide my journey.

In short my question comes down to, is it possible to have RDF stores backed by a so-called Native Graph Storage and then Implement Sparql in term of Traversal steps rather than joins over set as per its algebra ? Wouldn't that makes things a bit faster. It seems to be that this is somewhat the approach taken by https://github.com/graknlabs/grakn which is primarily backed by janusGraph for a graph like storage. Although it is not RDF, Graql is the same Idea as having RDFS++ + Sparql. They claim to just do it better, for which i have my reservation, but that's not the fundamental question of this thread. The bottom line is they back knowledge representation by the information retrieval (path traversal) and the accompanying storage approach that Property-Graph championed. Let me be clear on this, I am not saying that the graph native storage is the property of property graph. It is just in my mind a storage approach optimized to store Graph Structure where the information retrieval involve (path) traversal: https://docs.janusgraph.org/latest/data-model.html.

score 0 · Answer 2 · answered Dec 26 '18 at 13:09

First, I'd love to see the references that back up your claim that RDF-based systems are inherently less efficient than property graph ones, because frankly it's a nonsensical claim. Further, there have been distributed, and I'm assuming you mean scale-out, RDF stores, so the claim that they are not able to be distributed is simply incorrect.

The Property Graph model, and Gremlin, can easily be implemented on top of an RDF-based system. This has been done at twice once to my knowledge, and in one of those implementations reasoning was supported at the Gremlin/Property Graph layer. So you don't need to be a Property Graph based system to support that model. There are a myriad of reasons why systems, RDF and Property Graph, make specific implementation choices, from storage to execution and beyond, and those choices are guided some by the "native" model, the technology chosen for implementation, and perhaps most importantly, the use cases for the system and the problems it aims to solve.

Further, it's unclear what you recommend the authors of RDF-based systems actually do; are you suggesting scale-out is beneficial? Are you stating that your preference for the Propety Graph model should be taken as gospel such that RDF-based systems give up and switch data models? Do you want Property Graph systems retrofit RDFS?

Finally, to the initial question you asked, I think you have it exactly backwards; the Property Graph model is a hybrid graph model mixing elements of graph and key-value models, whereas the RDF model is a pure, ie native, graph model. Gremlin will be adopting the RDF model, albeit with syntactic sugar around what in the RDF world is called reification, but to everyone else, edge properties. So in the world where your exemplar of the Property Graph model is abandoning said model, I'm not sure what more to tell you, other than you should do a bit more background research.

Replied By posting a full answer. However thank you for the paper. It actually site the paper I posted in my response. This is definitively an interesting Paper. Somehow I read some of my understanding of things in it. It seems like somehow with TinkerPop 4 we will have a lot of flexibility, including a better support in translating any language (Sparql, Graql) to TinkerPOP, and use whatever backend behind the scene. I don't see it in opposition of my claim. However i need to read more of it. Thank you for the Paper, very interesting. — MaatDeamon, Dec 26 '18 at 17:47

Why aren't TripleStore implemented as Native Graph Store as Property-Graph Store are?

2 Answers2