1

It has taken me quite a long (calendar) time to get my head around CouchDB and map/reduce and how I can utilize it for various use cases. One challenge I've put myself to understanding is how to use it for normalized data effectively. Sources all over the internet simply stop with "don't use it for normalized data.". I do not like the lack of analysis on how to use it effectively with normalized data!

Some of the better resources I've found are below:

CouchDB: Single document vs "joining" documents together http://www.cmlenz.net/archives/2007/10/couchdb-joins

In both cases, the authors do a great job at explaining how to do a "join" when it is necessary to join documents when there is denormalized commonality across them. If, however, I need to join more than two normalized "tables" the view collation tricks leveraged to query just one row of data together do not work. That is, it seems you need some sort of data about all elements in the join to exist in all documents that would participate in the join, and thus, your data is not normalized!

Consider the following simple Q&A example (question/answer/answer comment):

{ id: "Q1", type: "question", question: "How do I...?" }
{ id: "A1", type: "answer", answer: "Simple... You just..." }
{ id: "C1", type: "answer-comment", comment: "Great... But what about...?" }
{ id: "C2", type: "answer-comment", comment: "Great... But what about...?" }
{ id: "QA1", type: "question-answer-relationship", q_id:"Q1", a_id:"A1" }
{ id: "AC1", type: "answer-comment-relationship", a_id:"A1", c_id:"C1" }
{ id: "AC2", type: "answer-comment-relationship", a_id:"A1", c_id:"C2" }
{ id: "Q2", type: "question", question: "What is the fastest...?" }
{ id: "A2", type: "answer", answer: "Do it this way..." }
{ id: "C3", type: "answer-comment", comment: "Works great! Thanks!" }
{ id: "QA2", type: "question-answer-relationship", q_id:"Q2", a_id:"A2" }
{ id: "AC3", type: "answer-comment-relationship", a_id:"A2", c_id:"C3" }

I want to get one question, its answer, and all of its answer's comments, and no other records from the databse with only one query.

With the data set above, at a high level, you'd need to have views for each record type, ask for a particular question with an id in mind, then in another view, use the question id to look up relationships specified by the question-answer-relationship type, then in another view look up the answer by the id obtained by the question-answer-relationship type, and so on and so forth, aggregating the "row" over a series of requests.

Another option might be to create some sort of application that does process above to cache denormalized documents in the desired format that automatically react to the normalized data being updated. This feels awkward and like a reimplementation of something that already exists/should exist.

After all of this background, the ultimate question is: Is there a better way to do this so the database, rather than the application, does the work?

Thanks in advance for anyone sharing their experience!

Community
  • 1
  • 1
Buddy
  • 160
  • 1
  • 8

1 Answers1

2

The document model you have is what I would do if I'm using traditional relational database, since you can perform joins more naturally with those ids.

For a document database however, this will introduce complexity since 'joining' document with MapReduce isn't the same thing.

In the Q&A scenario you presented, I would model it as follow:

{
    id: "Q1",
    type: "question",
    question: "How do I...?"
    answers: [
        {
            answer: "Simple... You just...",
            comments: [
                { comment: "Great... But what about...?" },
                { comment: "Great... But what about...?" }
            ]
        },
        {
            answer: "Do it this way...",
            comments: [
                { comment "Works great! Thanks!" },
                { comment "Nope, it doen't work" }
            ]
        }
    ]
}

This can solve a-lot of issues with read from the db, but it does make your write more complex, for example when adding a new comment to an answer, you will need to

  1. Get the document out from CouchDB.
  2. Loop through the answer and find the correct position, and push comment into the array.
  3. Save document back to CouchDB.

I'd only consider to spit the answer as a separate document if there's a-lot of them (e.g. 1 question yield 1000 answers'), otherwise it's easier to just package them in a single document. But even in that case, try putting the relationship info inside the document, e.g.

{
    id: "Q1",
    type: "question",
    question: "How do I...?"
}
{
    id: "A1",
    type: "answer",
    answer: "Simple... You just..."
    question_id: "Q1"
}
{
    id: "C1",
    type: "comment",
    comment: "Works great! Thanks!"
    answer_id: "A1"
}

This can make you'r write operation easier but you will need to create view to join the documents so it returns all documents with one request.

And always keep in mind that the return result from a view is not necessary a flat structure like rows like in sql query.

Andy
  • 5,287
  • 2
  • 41
  • 45
  • Thank you for your valuable time in creating such a well-thought response! However, your suggestion here goes completely against the constraint I'd like to keep of keeping the data normalized! – Buddy Aug 20 '14 at 05:43
  • In that case, you should go for RDBMS database that are made for normalized data. – Andy Aug 21 '14 at 10:46
  • I think what I'm realizing (that I've already read all over the place) is that the distributed qualities of a database like Couch requires that data be denormalized because you really get the benefits of the distributedness when each node has everything it needs to do its work. The trouble I have with that is managing what gets denormalized just feels... chaotic. And of course, it's not a piece of cake to keep proper tabs on things that are *supposed* to be the same that end up bot being that way. – Buddy Aug 23 '14 at 21:15
  • @Buddy, Please take into consideration that CouchDB does NOT distribute `Views` across nodes, only it's data is (via data replication). And CouchDB does NOT dispatch computations to nodes, each nodes / server works independently. – Andy Aug 24 '14 at 07:51