36

I am new to NoSQL world and thinking of replacing my MS Sql Server database to MongoDB. My application (written in .Net C#) interacts with IP Cameras and records meta data for each image coming from Camera, into MS SQL Database. On average, i am inserting about 86400 records per day for each camera and in current database schema I have created separate table for separate Camera images, e.g. Camera_1_Images, Camera_2_Images ... Camera_N_Images. Single image record consists of simple metadata info. like AutoId, FilePath, CreationDate. To add more details to this, my application initiates separate process (.exe) for each camera and each process inserts 1 record per second in relative table in database.

I need suggestions from (MongoDB) experts on following concerns:

  1. to tell if MongoDB is good for holding such data, which eventually will be queried against time ranges (e.g. retrieve all images of a particular camera between a specified hour)? Any suggestions about Document Based schema design for my case?

  2. What should be the specs of server (CPU, RAM, Disk)? any suggestion?

  3. Should i consider Sharding/Replication for this scenario (while considering the performance in writing to synch replica sets)?

  4. Are there any benefits of using multiple databases on same machine, so that one database will hold images of current day for all cameras, and the second one will be used to archive previous day images? I am thinking on this with respect to splitting reads and writes on separate databases. Because all read requests might be served by second database and writes to first one. Will it benefit or not? If yes then any idea to ensure that both databases are synced always.

Any other suggestions are welcomed please.

A-Sharabiani
  • 17,750
  • 17
  • 113
  • 128
theGeekster
  • 6,081
  • 12
  • 35
  • 47
  • http://stackoverflow.com/questions/1476295/when-to-use-mongodb-or-other-document-oriented-database-systems – Pranav 웃 Nov 02 '12 at 07:04
  • http://developer.olery.com/blog/goodbye-mongodb-hello-postgresql/ –  Mar 13 '15 at 11:56
  • 1
    Surprised to see a subjective question like this not voted for closure. but question though. – Ahsan Aug 26 '15 at 05:12

3 Answers3

30

I am myself a starter on NoSQL databases. So I am answering this at the expense of potential down votes but it will be a great learning experience for me.

Before trying my best to answer your questions I should say that if MS SQL Server is working well for you then stick with it. You have not mentioned any valid reason WHY you want to use MongoDB except the fact that you learnt about it as a document oriented db. Moreover I see that you have almost the same set of meta-data you are capturing for each camera i.e. your schema is dynamic.

  • to tell if MongoDB is good for holding such data, which eventually will be queried against time ranges (e.g. retrieve all images of a particular camera between a specified hour)? Any suggestions about Document Based schema design for my case?

MongoDB being a document oriented db, is good at querying within an aggregate (you call it document). Since you already are storing each camera's data in its own table, in MongoDB you will have a separate collection created for each camera. Here is how you perform date range queries.

  • What should be the specs of server (CPU, RAM, Disk)? any suggestion?

All NoSQL data bases are built to scale-out on commodity hardware. But by the way you have asked the question, you might be thinking of improving performance by scaling-up. You can start with a reasonable machine and as the load increases, you can keep adding more servers (scaling-out). You no need to plan and buy a high end server.

  • Should i consider Sharding/Replication for this scenario (while considering the performance in writing to synch replica sets)?

MongoDB locks the entire db for a single write (but yields for other operations) and is meant for systems which have more reads than writes. So this depends upon how your system is. There are multiple ways of sharding and should be domain specific. A generic answer is not possible. However some examples can be given like sharding by geography, by branches etc.

Also read A plain english introduction to CAP Theorem

Updated with answer to the comment on sharding

According to their documentation, You should consider deploying a sharded cluster, if:

  • your data set approaches or exceeds the storage capacity of a single node in your system.
  • the size of your system’s active working set will soon exceed the capacity of the maximum amount of RAM for your system.
  • your system has a large amount of write activity, a single MongoDB instance cannot write data fast enough to meet demand, and all other approaches have not reduced contention.

So based upon the last point yes. The auto-sharding feature is built to scale writes. In that case, you have a write lock per shard, not per database. But mine is a theoretical answer. I suggest you take consultation from 10gen.com group.

Community
  • 1
  • 1
Aravind Yarram
  • 78,777
  • 46
  • 231
  • 327
  • Good answer still, the only thing I would advise is that MongoDB, depending upon index size (etc) has a max limit of collections per DB on it's default ns size (which is recommended to keep to) of about 18K collections and since sharding works well on single collection basis it seems a good idea to promote the use of a single collection for all cameras. However +1 you bring up some good point like the lock, etc. – Sammaye Nov 02 '12 at 08:47
  • Can you tell me, if i can avoid global r/w locking by using separate Shard for each camera. Means having almost 100 shards on same machine? I am asking because i have just read on web that Mongo has separate lock for each shard. – theGeekster Nov 02 '12 at 11:15
4

to tell if MongoDB is good for holding such data, which eventually will be queried against time ranges (e.g. retrieve all images of a particular camera between a specified hour)?

This quiestion is too subjective for me to answer. From personal experience with numerous SQL solutions (ironically not MS SQL) I would say they are both equally as good, if done right.

Also:

What should be the specs of server (CPU, RAM, Disk)? any suggestion?

Depends on too many variables that only you know, however a small cluster of commodity hardware works quite well. I cannot really give a factual response to this question and it will come down to your testing.

As for a schema I would go for a document of the structure:

{
    _id: {},
    camera_name: "my awesome camera",
    images: [
        { 
            url: "http://I_like_S3_here.amazons3.com/my_image.png" ,
            // All your other fields per image
        }
    ]
}

This should be quite easy to mantain and update so long as you are not embedding much deeper since then it could become a bit of pain, however, that depends upon your queries.

Not only that but this should be good for sharding since you have all the data you need in one document, if you were to shard on _id you could probably get the perfect setup here.

Should i consider Sharding/Replication for this scenario (while considering the performance in writing to synch replica sets)?

Possibly, many people assume they need to shard when in reality they just need to be more intelligent in how they design the database. MongoDB is very free form so there are a lot of ways to do it wrong, but that being said, there are also a lot of ways of dong it right. I personally would keep sharding in mind. Replication can be very useful too.

Are there any benefits of using multiple databases on same machine, so that one database will hold images of current day for all cameras, and the second one will be used to archive previous day images?

Even though MongoDBs write lock is on DB level (currently) I would say: No. The right document structure and the right sharding/replication (if needed) should be able to handle this in a single document based collection(s) under a single DB. Not only that but you can direct writes and reads within a cluster to certain servers so as to create a concurrency situation between certain machines in your cluster. I would promote the correct usage of MongoDBs concurrency features over DB separation.

Edit

After reading the question again I omitted from my solution that you are inserting 80k+ images for each camera a day. As such instead of the embedded option I would actually make a row per image in a collection called images and then a camera collection and query the two like you would in SQL.

Sharding the images collection should be just as easy on camera_id.

Also make sure you take you working set into consideration with your server.

Sammaye
  • 43,242
  • 7
  • 104
  • 146
  • Does MongoDB use separate lock for each shard? and is there a limit on number of shards for single database on single machine? – theGeekster Nov 02 '12 at 12:03
  • @theGeekster I believe this answers your questions: http://www.mongodb.org/display/DOCS/How+does+concurrency+work the MongoS is a shard while the Mongod is the database on that shard. So it is a lock per shard, but that shard has db lock. As far I know there is only a limit of replicas, not shards: http://stackoverflow.com/questions/8686420/what-is-the-maximum-number-of-shards-mongodb-can-have – Sammaye Nov 02 '12 at 12:13
  • Can you please explain a little bit as you said: "_So it is a lock per shard, but that shard has db lock_" ? Also your suggestion for sharding on **camera_id**, does that mean a single collection of images will be split by MongoDB on different shards (1 shard for 1 camera)? Is this better to 1 collection for 1 camera and still use camera_id as shard key? (because number of images for 1 camera will be quite high like 30,00,000). – theGeekster Nov 04 '12 at 01:54
  • @theGeekster It should mean one shard per camera, however MongoDB can move chunks to their best fit position depending on space on servers etc. As for the lock it means that each mongod that makes up a shard cluster cannot respond to other write locks (read lock is concurrent) while they are used however each mongod within that cluster is inpedendant so you can write to one computer and the other at the same time. This is why it is key to have a shard key that balances writes across all shards (if that becomes mega important) – Sammaye Nov 04 '12 at 12:18
  • @theGeekster Though that being about my previous on sharding camera_id, you can also manage MongoDBs sharding yourself, so as I said MongoDB is very free form, however you should focus on the beginner stuff first instead of diving straight into making MongoDB split chunks how you want manually. – Sammaye Nov 04 '12 at 12:59
3

to tell if MongoDB is good for holding such data, which eventually will be queried against time ranges (e.g. retrieve all images of a particular camera between a specified hour)? Any suggestions about Document Based schema design for my case?

MongoDB can do this. For better performance, you can set an index on your time field.

What should be the specs of server (CPU, RAM, Disk)? any suggestion?

I think RAM and Disk would be important.

  • If you don't want to do sharding to scale out, you should consider a larger size of disk so you can store all your data in it.
  • Your hot data should can fit into your RAM. If not, then you should consider a larger RAM because the performance of MongoDB mainly depends on RAM.

Should i consider Sharding/Replication for this scenario (while considering the performance in writing to synch replica sets)?

I don't know many cameras do you have, even 1000 inserts/second with total 1000 cameras should still be easy to MongoDB. If you are concerning insert performance, I don't think you need to do sharding(Except the data size are too big that you have to separate them into several machines).

Another problem is the read frequency of your application. It it is very high, then you can consider sharding or replication here. And you can use (timestamp + camera_id) as your sharding key if your query only on one camera in a time range.

Are there any benefits of using multiple databases on same machine, so that one database will hold images of current day for all cameras, and the second one will be used to archive previous day images?

You can separate the table into two collections(archive and current). And set index only on archive if you only query date on archive. Without the overhead of index creation, the current collection should benefit with insert.

And you can write a daily program to dump the current data into archive.

Staeff
  • 4,994
  • 6
  • 34
  • 58
Chien-Wei Huang
  • 1,773
  • 1
  • 17
  • 27
  • Thanks for the great reply, actually i was thinking on the same lines to split the data into two groups (today/current and archive/old). And my read frequency is on average 1 per 2-seconds. 1: For inserts i can assume that MongoDB v2.2 will have no issues at the rate of 1000/second, but what it will be doing with READ requests at the same time (will it not lock the reads)? How can i avoid this read locking when writes are always occurring. 2: What do you think if i make two separate databases/shards (current and archive) to have separate locks? Then always write to current shard.. – theGeekster Nov 03 '12 at 05:53
  • 1
    ... then always write to current shard (with index on datetime) and at the day end move all the data of current shard to archive. The reads for current day will be served by current shard and for previous days will be served by archive shard. Here i can think of few concerns as well, 1) reads for the current days are still vulnerable to r/w locks, 2) have two shards on same machine might rise some Memory conflicts among both for writes and reads and indexes, 3) at day end if moving the data from current shard to archive take few minutes, should reads during that time goto current or archive? – theGeekster Nov 03 '12 at 06:09