20

I am designing a system with MongoDb (64 bit version) to handle a large amount of users (around 100,000) and each user will have large amounts of data (around 1 million records).

What is the best strategy of design?

  1. Dump all records in single collection

  2. Have a collection for each user

  3. Have a database for each user.

Many Thanks,

DafaDil
  • 2,463
  • 6
  • 23
  • 33
  • From the database architecture viewpoint I would recommend to use a single collection, but I am not sure if they still scale so well when you have hundreds of *billions* of records in them. – Philipp Dec 18 '12 at 17:07

3 Answers3

19

So you're looking at somewhere in the region of 100 billion records (1 million records * 100,000 users).

The preferred way to deal with large amounts of data is to create a sharded cluster that splits the data out over several servers that are presented as single logical unit via the mongo client.

Therefore the answer to your question is put all your records in a single sharded collection.

The number of shards required and configuration of the cluster is related to the size of the data and other factors such as the quantity and distribution of reads and writes. The answers to those questions are probably very specific to your unique situation, so I won't attempt to guess them.

I'd probably start by deciding how many shards you have the time and machines available to set up and testing the system on a cluster of that many machines. Based on the performance of that, you can decide whether you need more or fewer shards in your cluster

chrisbunney
  • 5,819
  • 7
  • 48
  • 67
  • 4
    While sharding architecture is definitely relevant in this scenario, your post doesn't address the question of the OP, which was about whether to use one collection, multiple collection or multiple databases. – Philipp Dec 18 '12 at 17:43
  • 4
    Ah yes, options 2 and 3 were so counter intuitive for me that I forgot to make it explicit that you should put it into a single collection and shard – chrisbunney Dec 18 '12 at 19:25
  • 3
    @chrisbunney what is your 2 pennies on using a pattern of "databases or collections for every user" for the sole purpose of security and simplified access control management ? – kommradHomer Feb 04 '15 at 14:55
9

So you are looking for 100,000,000 detail records overall for 100K users?

What many people don't seem to understand is that MongoDB is good at horizontal scaling. Horizontal scaling is normally classed as scaling huge single collections of data across many (many) servers in a huge cluster.

So already if you use a single collection for common data (i.e. one collection called user and one called detail) you are suiting MongoDBs core purpose and build.

MongoDB, as mentioned, by others is not so good at scaling vertically across many collections. It has a nssize limit to begin with and even though 12K initial collections is estimated in reality due to index size you can have as little as 5K collections in your database.

So a collection per user is not feasible at all. It would be using MongoDB against its core principles.

Having a database per user involves the same problems, maybe more, as having singular collections per user.

I have never encountered some one not being able to scale MongoDB to the billions or even close to the 100s of billions (or maybe beyond) on a optimised set-up, however, I do not see why it cannot; after all Facebook is able to make MySQL scale into the 100s of billions per user (across 32K+ shards) for them and the sharding concept is similar between the two databases.

So the theory and possibility of doing this is there. It is all about choosing the right schema and shard concept and key (and severs and network etc etc etc etc).

If you were to witness problems you could go for splitting archive collections, or deleted items away from the main collection but I think that is overkill, instead you want to make sure that MongoDB knows where each segment of your huge dataset is at any given point in time on the master and ensure that this data is always hot, that way queries that don't do a global and scatter OP should be quite fast.

Sammaye
  • 43,242
  • 7
  • 104
  • 146
4

About a collection on each users:

By default configuration, MongoDB is limited to 12k collections. You can increase the size of this with --nssize but it's not unlimited. And you have to count index into this 12k. (check "namespaces" concept on mongo documentation).

About a database for each user:

For a model point of view, that's very curious. For technical, there is no limit on mongo, but you probably have a limit with file descriptor (limit from you OS/settings).

So as @Rohit says, the two last are not good. Maybe you should explain more about your case. Maybe you can cut users into different collections (ex: one for each first letter of name etc., or for each service of the company...). And, of course use sharding.

Edit: maybe MongoDb is not the best database for your use case.

Aurélien B
  • 4,590
  • 3
  • 34
  • 48