7

I've been toying with modeling a graph structure (property graph with named relationships) in couchdb and would like to know what are the potential bottlenecks in performance that I will find.

I'm using the following principles:

  • Keep documents small.
  • Try to embed as little as possible.
  • Record all relationships between documents as a new document (a link).

It seems that all these principles are in contradiction with CouchDB philosophy,

With this principles, for example, tagging a person becomes three documents:

{ _id: '10', type: 'person', 'name': 'John Doe' }
{ _id: '20', type: 'tag', 'name': 'Important' }
{ _id: '30', type: 'link', from: 10, to: 20, name: 'tag' }

I have also created the following views in a _design document called links:

{
  outgoing: {
    map: function(doc) {
      if (doc.type == 'link') {
        emit([doc.from, doc.name], {_id: doc.to});
      }
    }
  },
  incoming: {
    map: function(doc) {
      if (doc.type == 'link') {
        emit([doc.to, doc.name], { _id: doc.from });
      }
    }
  }
}

I can get all the links incoming or outgoing from a document with these urls:

http://host/db/_design/links/_view/incoming?startkey=["10"]&endkey=["10",{}]
http://host/db/_design/links/_view/outgoing?startkey=["10"]&endkey=["10",{}]

I can even get all the links by name with these urls:

http://host/db/_design/links/_view/incoming?startkey=["10","tag"]&endkey=["10","tag",{}]
http://host/db/_design/links/_view/outgoing?startkey=["10","tag"]&endkey=["10","tag",{}]

And if I include the include_docs=true parameter I get the documents referenced by the link; either incoming or outgoing. So far so good. There is a graph structure and a way to query it, albeit on a node by node basis.

Good things about this approach:

  • It is a general way of storing all relationships. Not necessarily tags, but every relationship.
  • You can change the tag name quickly, without changing every person tagged.
  • You can merge persons or tags and just update the link documents, which should be very simple.
  • Tagging when using replication does not change the documents being tagged or the tags themselves. Just add or delete a tiny link document.
  • It would be easy to keep a history of tags for each element.

Bad things, and where I need your help:

  • Query for a list of people with their tags is not trivial. In general, querying for a list of documents and their relationships is a very expensive operation that requires many hits.
  • Updating the database and keeping it consistent could be a problem. Maybe this is something that will never go away when using couch.
  • Doing 'maintenance' on the database, like finding orphan links, could be expensive. Perhaps the database requires garbage collection?
  • Visualizing and manipulating this graph structure is neither intuitive nor simple, and applications developed on top of it are responsible for all the graph structure management (which is a bit scary!).

So back to my questions:

  1. What are the potencial bottlenecks to expect?
  2. Will this approach scale to millions of records?
  3. How to do traversing of this structure efficiently without having to do many server hits?
Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
Ricardo Marimon
  • 10,339
  • 9
  • 52
  • 59
  • So to navigate the graph you would either need to pull back the whole topology or make multiple queries as you expand the "depth" or change the direction of how you navigate, e.g. starting in one context and the moving to another linked context. Why not use a Graph DB that has been designed with having links as first class citizens? Toy project or...? – Daniel Sep 21 '14 at 18:47
  • @Daniel, it all started with Neo4J where links exists. But when trying to move to a multi master, mobile first world, the couchdb replication framework is part of what I need badly. – Ricardo Marimon Sep 22 '14 at 19:34
  • So a data model is discarded for a technology "failure"? If you continue the above path I'm more than interested in a follow up blog post or something about the outcome. Have you looked at alternatives to Neo? http://www.orientechnologies.com/orientdb-vs-neo4j/ – Daniel Sep 23 '14 at 14:57
  • Out of curiosity. What does mobile first has to do with CouchDB's replication? Are you referring to e.g. Pouch and client side storage with replication to CouchDB node? – Daniel Sep 23 '14 at 15:01
  • @Daniel, mobile first means that the application we are experimenting with will mostly be used on mobile devices, and in Latin America, where data coverage is "ify" to say the least. All these applications will have Pouch on the local side and replicate to a centralized Couch. So the replication part of couch/pouch is extremely important. – Ricardo Marimon Sep 25 '14 at 13:30
  • Pretty much what my guess was then. So the linked construct will be server-side then I guess. And only materialized data would be going back and forth. Can't that coexist with a graphdb? Or potentially you could write a custom implementation of the Couch replication protocol. Anyway, would be interesting to hear how this turns out. – Daniel Sep 25 '14 at 16:04
  • Based on the way the problem is presented, you've tried to map a graph onto a document database using a relational design. ("link" is really just a many-to-many association with the third entity, linkType, embedded in the type name "tag".) It would help to see the graph use case. Are you going to start with a person and then find similarly tagged people, and then use their tags for related tags, etc.? – NaturalData Dec 06 '14 at 15:55

0 Answers0