2

I am playing with MongoDB since yesterday and absolutely love it. I am trying to import lots of data (2 billion rows) and index it but it doesn't seem to be using the 8 cores that my system has adn the import is going at normal rates (60000 records/sec). I can only imagine how long it might take to index two columns in this collection. Are there any MondoDB type databases that exploit multicore nature of CPUs?

Community
  • 1
  • 1
Legend
  • 113,822
  • 119
  • 272
  • 400
  • I don't think you can efficiently deal with data insertion from many cores at once. – zneak Oct 03 '10 at 01:06
  • 1
    @zneak: I was actually thinking of inserting data into multiple collections (tables). Just that I wasn't really sure if splitting up that way would be beneficial. I mean, instead of adding an index on a large table, I would create many small tables and then create an index on each of them. And then from the front-end, I will query all the tables to get the required values. Do you have any suggestions on this approach? – Legend Oct 03 '10 at 01:12
  • 1
    I'm no NoSQL expert. All I know is that you can't safely modify data from multiple threads without locking it, hence my comment. :/ Sorry I can't help. (Though, I also know that you should fill your collections, then create the indexes. Otherwise, you waste a lot of time reindexing the data at each insert.) – zneak Oct 03 '10 at 04:41
  • @zneak: Yes. Thank you for the pointers. I am no indexing it. I let it ran and now it finished inserting about 1.5 billion :) Not sure how long it will take building the indexes though... Thanks again for your help. – Legend Oct 03 '10 at 04:45

1 Answers1

9

If MongoDB has an achilles heel it's the fact that it only supports single-threaded writes and single-threaded map-reduces.

As always, there are trade-offs here. Single-threaded writes are the simplest way to avoid lock problems and minimize overhead. In the same fashion multi-threaded map-reduces are a great way to lock your data. So single-threaded map-reduces on a production system are probably easier and safer.

However, you're not without tools here. MongoDB will provide one write thread to each instance. So if you shard MongoDB, then you'll get one write thread for each shard.

If you want multiple indexes on 2 billion rows, you'll want to look at sharding anyways. Some quick math here: MongoID is 12 bytes. Index on MongoID will be 2B * 12 bytes = 22GB+. If you now want to add two more indexes (even just two 4-byte integers) we're talking about 7.5GB for each.

So at 2B rows, you're talking about having over 37GBs in indexes (minimum). On most 8-core servers, that means that you won't even be able to keep you indexes in memory, let alone any of the data.

So if you want serious performance here, you'll want to start looking at sharding. Just based on the general numbers. FWIW, MySQL would be no more adept at handling 2B documents. With that much data, you're really going to want multiple servers to keep up with the load.

Gates VP
  • 44,957
  • 11
  • 105
  • 108
  • I guess sharding is the only way to go this time. Thank you for your suggestions. I am indexing in both MySQL and MongoDB. Will use whichever completes first :) Also a good benchmarking exercise... – Legend Oct 03 '10 at 06:32