Apache Jackrabbit OAK - Sharding DocumentNodeStore across cluster by node path

Question

I am struggling to find enough documentation and examples for constructing and using Jackrabbit OAK in a clustered environment through sharding node stores by path. I know this is possible because there are references in a few places but with very little information, and the OAK or NodeStore API's are not intuitive enough to find this functionality.

Take a look at slide 17 in this PDF which lists the various sharding strategies. http://events.linuxfoundation.org/sites/events/files/slides/the%20architecture%20of%20Oak.pdf

My use case is that I need to have several remote servers all running the same Jackrabbit OAK application which uses the DocumentNodeStore backed by MongoDB for the node and blob storage. What I ultimately want is to shard (or partition) portions of my data across these remote servers organized by different paths in the overall node structure.

For example:

Server (A)
Is responsible for storing content at /a/*

Server (B)
Is responsible for storing content at /b/*

If Server (A) wants to read or write content at /b/*, it can access nodes at that path using the normal JCR or OAK API's which should completely abstract the user from the network details and the connection to the Server (B) MongoDB.

Is there any solid documentation relating to this use case? If not, what is the best way to go about learning this? I can spend the whole day wandering through the OAK source code, but documentation would be much preferred.

I don't think this is how clustering in Oak works. Each node in the cluster needs to have access to *all* documents. — Julian Reschke, Sep 10 '17 at 04:13
@JulianReschke, I think OP wanted to understand how can the mongo setup backing the repository be sharded. Afaiu, mongo sharding would still allow all client to read any document - just that it's best if one client could read minimal documents from potential very-remote shard instance. I mean in his example, A should mostly be concerned with /a/* (sure root would need to be read too) — catholicon, Sep 11 '17 at 12:59

catholicon · Accepted Answer · 2017-09-12T22:50:58.667

1

Oak's Mongodb implementation at this point does not have a strategy for sharding. The problem essentially comes down to the fact that _id of Mongo documents stored by Oak won't put documents across shards such that probabilistically bunch of nodes from same sub tree land on same shard instance. There had been some conversation of adding a shard key to handle the use case but the discussion didn't move ahead much as at this point we haven't seen a compelling use case which requires shards.

That said, afaik, you can setup a sharded instance and provide mongouri accordingly. What I said above most likely won't scale as nicely as you might want. Also, that we haven't have seen setups which couldn't be handled by non-sharded setup.

I know it doesn't answer your question, but maybe it's an indication why you couldn't find much said about the topic.

edited Sep 12 '17 at 22:50

answered Sep 11 '17 at 02:46

catholicon

1,165
2
8
15

Thanks for the great explanation! I am thinking that I may able to get away with clustering multiple standalone mongodb instances by "sharding" the data myself. Luckily my use case is rather simple, and I am able to partition the dataset easily. I can implement some form of service discovery to find and connect to the various mongodb instances in the cluster. However, with this setup, I am confused how I would have a single OAK instance connect to the various mongodb instances. Would I need to create a new Repository instance for each? Is this the right approach? – Jon McPherson Sep 12 '17 at 04:34
1

I think technically your comment is different question and probably against SO policies :). Anyway, an Oak cluster would connect to only one mongo db setup (single instance, replica, shard, whatever). The type of partitioning you're describing could only be done at application sitting above Oak. And yes, you'd need multiple oak instances where your app multiplexes between those. About "is this the right approach": I think you might be doing early optimization. I'd suggest not go for complex setup at the beginning. Also, hard-coded partitioning might hurt in the long run. – catholicon Sep 12 '17 at 22:48

Apache Jackrabbit OAK - Sharding DocumentNodeStore across cluster by node path

1 Answers1