4

Which NoSQL database do you recommend and how would the schema look for the following web application requirements.

  1. There can be lot of users (500k+)

  2. Every user can enter his/her documents

  3. Every user will probably create 10-200 documents per month

  4. Every document will be small (around 100 words)

  5. User can tag documents with his/her own tags

  6. Data from different user does NOT interact with other users and their data

  7. User can search his entries by tags

  8. Fast access to all entries from one user

  9. user can create complex dynamic queries to query his / her data

My idea is to use MongoDB. But the problem that I see is that there would be just two collections: users and entries.

Searching by tags through one gigantic collection looks like a bad idea to me. I am afraid that the size of indexes will be really large, since every user can have his own tags. MongoDB will create tag indexes for the whole collection, but I will always search by tags only through entries from one user and not from all.

Thus a collection per user idea seems more suitable, but there seems to be a limit on how many collections one can create, also this approach appears to be undesired.

CouchDB does not support dynamic queries,...

How should I implement this in MongoDB? Or name a more appropriate NoSQL database.

Examples of similar applications: rememberthemilk, Trello, ...

Community
  • 1
  • 1
Ben
  • 2,435
  • 6
  • 43
  • 57

1 Answers1

2

Which NoSQL database do you recommend and how would the schema look for the following web application requirements.

I am not going to define your application for you as you have asked since we are not here for that however I will answer some of the problems and questions you actually state here.

I am afraid that the size of indexes will be really large, since every user can have his own tags

That is true the index size could be considerable, unless you limited the amount of tags a user can apply. Most sites limit tags by 10 at most, sometimes (like for questions here) 5.

You might wanna look into sharding that collection into smaller pieces across a cluster. Querying by these tags over a properly defined shard index is by no means slow or bad.

Even if the tags index is not your shard index it will still perform a very fast global scatter and gather operation (a good example of query usage across large collections is here: http://docs.mongodb.org/manual/core/sharding/ ).

Sharding can also help distribute the huge size of the index across many commodity computers allowing you to reduce costs but keep up the flow of data.

So the first thing you want to look into is sharding and how it can work to help you, a good place to start in this respect is here: http://docs.mongodb.org/manual/core/sharding/

Thus a collection per user idea seems more suitable, but there seems to be a limit on how many collections one can create, also this approach appears to be undesired.

You also have the problem of a lock, since the lock is not collection level unlike SQL it is infact DB level (and don't forget the namespace restriction which is dependant upon the size of your now "massive" indexes). Many people fall into the trap and I am gong to state now that a normal setup is fine for like 99% of cases, unless you might be Facebook but even then I think it might be fine.

Examples of similar applications: rememberthemilk, Trello, ...

I actually just had someone ask a similar style question: How does Trello store data in MongoDB? (Collection per board?) if you look to the comments there might be some help there too.

Community
  • 1
  • 1
Sammaye
  • 43,242
  • 7
  • 104
  • 146
  • The problem with tags is that each user can have his own set of tags. Here at SO all users use the same tags. Even if there is a limit on the number of tags per document, there is not a limit on the number of tags one user can have. Thus there can be a lot of different tags in one collection. Of course I would always search first by user ID and only then by tags... – Ben Oct 19 '12 at 10:45
  • 1
    @Ben Not always, at 1k rep you can make your own tags, even with the high selectivity of the field I don't see a massive problem, fair enough I haven't built your app but immediatly, without testing, I don't see a serious problem if you plan your cluster right. It will be a big index but it is something that cannot be avoided, you could split tags off but then you will loose context searching for some documents since MongoDB has no joins and NoSQL in general don't. – Sammaye Oct 19 '12 at 10:51
  • So you think that I should shard per UserID and add indexes to UserID and tags and other fields that I may need to. And that should scale without problems? – Ben Oct 19 '12 at 10:57
  • @Ben It depends, if 95% of your queries are both user_id and tag then I would shard on a compound index of the two, really this comes down to your querying pattern. What do you do most? You should really really REALLY think about your shard index – Sammaye Oct 19 '12 at 10:58
  • All queries are by user_id. Most of the queries are by both user id and tags. No queries are by tags only. But why should I choose more than one sharding key. because if I shard based on user_id, then all data from (at least one) user will be one one server. Thus there is no need for additional tag based sharding, or is it? – Ben Oct 19 '12 at 11:01
  • 1
    @Ben Im gonna go with the bet of a compound index on user_id and tags in that case, MongoDb can use partial indexes so a query only using user_id should be able to use the main shard index. Not all of one users data might be on one server, mongodb will split the chunks as it sees fit which means not all of that user data might be on the single server, though that just gave me another idea, you can use tag aware sharding (v2.2) to actually accomplish that if you like :) which maybe (would need testing) would lower your index size – Sammaye Oct 19 '12 at 11:07
  • @Ben I realised that I fleshed out my idea without much explanation just then. You are right in 90% of cases ranging on a single collection will result with that users data being on one shard (unless it doesn't fit, in which case it will be moved) but if you got multiple collections and wanna group the users data then you can do a few tricks with tag aware sharding to get this working right, which in turn could lower index size. – Sammaye Oct 19 '12 at 11:55