Basic set-based operations using a document database (noSQL)

Question

As with most, I come from and RDMS world trying to get my head around noSQL databases and specifically document stores (as I find them the most interesting).

I am try to understand how to perform some set-based operations using a document database (I'm playing with RavenDB).

So as per my understanding:

Union (as in SQL UNION) is very straight forward append. Additionally unions between different sets (SQL JOIN) can be achieved map/reduce. The example given in the RavenDB mythology book with Comment counts on Blogs entries is a good start.
Intersection can be performed using a number of techniques from de-normalization right through to creating a “mapping” or “link” document as described here (and the aggregator example below). In an RDMS this would be performed using a simple "INNER JOIN" or "WHERE x IN"
Subtract (Relative Complement) is where I am getting stuck. In an RDMS this operation is simply a "WHERE x NOT IN" or a "LEFT JOIN" where the joined set is NULL.

Using a real world example let’s say we have an RSS aggregator (such as Google Reader) which has millions if not billions of RSS entries with thousands of users, each tagging favourite, etc.

In this example we focus on entry, user and tag; where tag acts as a link between user and entry.

user {string id, string name /*etc.*/}
entry {string id, string title, string url /*etc.*/}
tag {string userId, string entryId, string[] tags} /* (favourite, read, etc.)*/

With the above approach it is easy to perform the intersection between entry and user using tag. But I cannot get my head around how one would perform a subtract. For instance “Return all items that do not have any tags” or even more daunting “return the latest 1000 items without any tag”.

So my question:

Can you point me to some reading material on the matter?
Can you share some ideas on how one can accomplish the task efficiently?

Note: I know that you lose query flexibility with document databases, but surely there must be a way to do this?

score 2 · Accepted Answer · answered Jul 19 '11 at 10:16

Amok, What you want cannot really be done easily in non relational databases. Mostly because they don't think in sets and have strong ties to distributed computing. You can't really do efficient sets without having access to all the data, for example, and that pretty much means that any set based operation is going to have to need access to all of that. Since NoSQL dbs are usually used in distributed scenarios, they can't really support that. RavenDB, specifically, allows some operations on a specified set, but it is built strongly on the assumption of independent documents, that don't have strong relations to other documents, or documents that need to be manipulated all together in the same fashion.

How correct is this comment given the flurry of changes in ravenDB? — Andrew Harry, Apr 13 '12 at 02:10

score 0 · Answer 2 · answered Jul 16 '11 at 22:58

0

Transition from RDBMS to a document database isn't completely smooth, and some refactoring to your Model may be necessary to make it optimal. This is due to the different natures of those technologies.

Re. set-based operations in RavenDB, see:

http://ayende.com/blog/4535/set-based-operations-with-ravendb

http://ravendb.net/documentation/set-based

answered Jul 16 '11 at 22:58

synhershko

4,472
1
30
37

Unfortunately your response does not quite answer the question. The links you provided refer to very simple operations. What I am looking for is a technique to handle a "NOT IN" operation (subtract or relative complement) between two sets using a document store. The only apparent solution at this stage appears to be through custom application code which will no be entirely efficient. – amok Jul 18 '11 at 12:01

Basic set-based operations using a document database (noSQL)

2 Answers2