I've been toying with modeling a graph structure (property graph with named relationships) in couchdb and would like to know what are the potential bottlenecks in performance that I will find.
I'm using the following principles:
- Keep documents small.
- Try to embed as little as possible.
- Record all relationships between documents as a new document (a link).
It seems that all these principles are in contradiction with CouchDB philosophy,
With this principles, for example, tagging a person becomes three documents:
{ _id: '10', type: 'person', 'name': 'John Doe' }
{ _id: '20', type: 'tag', 'name': 'Important' }
{ _id: '30', type: 'link', from: 10, to: 20, name: 'tag' }
I have also created the following views in a _design
document called links
:
{
outgoing: {
map: function(doc) {
if (doc.type == 'link') {
emit([doc.from, doc.name], {_id: doc.to});
}
}
},
incoming: {
map: function(doc) {
if (doc.type == 'link') {
emit([doc.to, doc.name], { _id: doc.from });
}
}
}
}
I can get all the links incoming or outgoing from a document with these urls:
http://host/db/_design/links/_view/incoming?startkey=["10"]&endkey=["10",{}]
http://host/db/_design/links/_view/outgoing?startkey=["10"]&endkey=["10",{}]
I can even get all the links by name with these urls:
http://host/db/_design/links/_view/incoming?startkey=["10","tag"]&endkey=["10","tag",{}]
http://host/db/_design/links/_view/outgoing?startkey=["10","tag"]&endkey=["10","tag",{}]
And if I include the include_docs=true
parameter I get the documents referenced by the link; either incoming or outgoing. So far so good. There is a graph structure and a way to query it, albeit on a node by node basis.
Good things about this approach:
- It is a general way of storing all relationships. Not necessarily tags, but every relationship.
- You can change the tag name quickly, without changing every person tagged.
- You can merge persons or tags and just update the
link
documents, which should be very simple. - Tagging when using replication does not change the documents being tagged or the tags themselves. Just add or delete a tiny
link
document. - It would be easy to keep a history of tags for each element.
Bad things, and where I need your help:
- Query for a list of people with their tags is not trivial. In general, querying for a list of documents and their relationships is a very expensive operation that requires many hits.
- Updating the database and keeping it consistent could be a problem. Maybe this is something that will never go away when using couch.
- Doing 'maintenance' on the database, like finding orphan links, could be expensive. Perhaps the database requires garbage collection?
- Visualizing and manipulating this graph structure is neither intuitive nor simple, and applications developed on top of it are responsible for all the graph structure management (which is a bit scary!).
So back to my questions:
- What are the potencial bottlenecks to expect?
- Will this approach scale to millions of records?
- How to do traversing of this structure efficiently without having to do many server hits?