How does MonogoDB stack up for very large data sets where only some of the data is volatile

Question

I'm working on a project where we periodically collect large quantities of e-mail via IMAP or POP, perform analysis on it (such as clustering into conversations, extracting important sentences etc.), and then present views via the web to the end user.

The main view will be a facebook-like profile page for each contact of the the most recent (20 or so) conversations that each of them have had from the e-mail we capture.

For us, it's important to be able to retrieve the profile page and recent 20 items frequently and quickly. We may also be frequently inserting recent e-mails into this feed. For this, document storage and MongoDB's low-cost atomic writes seem pretty attractive.

However we'll also have a LARGE volume of old e-mail conversations that won't be frequently accessed (since they won't appear in the most recent 20 items, folks will only see them if they search for them, which will be relatively rare). Furthermore, the size of this data will grow more quickly than the contact store over time.

From what I've read, MongoDB seems to more or less require the entire data set to remain in RAM, and the only way to work around this is to use virtual memory, which can carry a significant overhead. Particularly if Mongo isn't able to differentiate between the volatile data (profiles/feeds) and non-volatile data (old emails), this could end up being quite nasty (and since it seems to devolve the virtual memory allocation to the OS, I don't see how the this would be possible for Mongo to do).

It would seem that the only choices are to either (a) buy enough RAM to store everything, which is fine for the volatile data, but hardly cost efficient for capturing TB of e-mails, or (b) use virtual memory and see reads/writes on our volatile data slow to a crawl.

Is this correct, or am I missing something? Would MongoDB be a good fit for this particular problem? If so, what would the configuration look like?

score 3 · Answer 1 · answered Feb 04 '11 at 01:53

3

MongoDB does not "require the entire data set to remain in RAM". See http://www.mongodb.org/display/DOCS/Caching for an explanation as to why/how it uses virtual memory the way it does.

It would be fine for this application. If your sorting and filtering were more complex you might, for example, want to use a Map-Reduce operation to create a collection that's "display ready" but for a simple date ordered set the existing indexes will work just fine.

answered Feb 04 '11 at 01:53

Ian Mercer

38,490
8
97
133

My question is how does monogdb's caching (which in reality is the operating system's caching) know which data to keep in physical RAM vs virtual RAM. Can I be sure that, for example, on a 500GB dataset on a machine with 20GB RAM, that the volatile data like contacts is preferentially kept in the 20GB in RAM and the non-volatile data that hardly ever gets read (old emails) is the first thing to move to disk? – Andrew J Feb 04 '11 at 03:06
1

There is only physical RAM, everything else is called "disk". There is no such thing as "virtual RAM". As to how virtual memory works you really don't need to know. To begin to understand it you'd need to look at how MongoDB handles indexes as well as how it handles collections. Splitting a collection into 'old' and 'new' is going to be a headache if you ever want to search or sort across it. MongoDB isn't going to load the whole thing into memory - don't worry! Just benchmark it, you'll find it's bindingly fast! – Ian Mercer Feb 04 '11 at 06:20

score 2 · Accepted Answer · answered Feb 04 '11 at 01:49

2

MongoDB uses mmap to map documents into virtual memory (not physical RAM). Mongo does not require the entire dataset to be in RAM but you will want your 'working set' in memory (working set should be a subset of your entire dataset).

If you want to avoid mapping large amounts of email into virtual memory you could have your profile document include an array of ObjectIds that refer to the emails stored in a separate collection.

answered Feb 04 '11 at 01:49

Bernie Hackett

8,749
1
27
20

Thanks for this answer. Is it fair to assume that collections are the best level of granularity of managing caching (ie. one collection for volatile data which will be read/written frequently and a separate collection that generally stays on disk)? – Andrew J Feb 04 '11 at 02:08

score 1 · Answer 3 · edited Feb 04 '11 at 02:05

@Andrew J Typical you need enough RAM to hold your working set, this is true for MongoDB as it is for an RDBMS. So if you want to hold the last 20 emails for all users without going to disk, then you need that much memory. If this exceed the memory on a single system, then you can use MongoDB's sharding feature to spread data across multiple machines, therefore aggregating the Memory, CPU and IO bandwidth of the machines in the cluster.

@mP MongoDB allows you as the application developer to specify the durability of your writes, from a single node in memory to multiple nodes on disk. The choice is your depending on what your needs are and how critical the data is; not all data is created equally. In addition in MongoDB 1.8, you can specify --dur, this writes a journal file for all the writes. This further improves the durability of writes and speeds up recovery if there is a crash.

score -8 · Answer 4 · answered Feb 04 '11 at 00:23

-8

And what happens if your computer crashes to all the stuff Mongo had in memory. Im guessing that it has no logs so the answer is probably bad luck.

answered Feb 04 '11 at 00:23

mP.

18,002
10
71
105

'guessing' isn't going to help anyone as an answer. – Ian Mercer Feb 04 '11 at 01:46
1

Mongo persists to disk, but not on each write (to maintain high write performance). There is a possibility of small data loss between when the write is committed to memory and then to disk, which has been well discussed and is not a concern here. However the entire DB is kept in RAM as well, which might be physical or virtual RAM depending on how big the data store is. My question is about how smart is mongo and figuring out which part of that data should be in physical RAM vs virtual RAM. – Andrew J Feb 04 '11 at 03:08
1

MongoDB 1.7.5 adds durability via a journal. This feature will be in the stable 1.8 release. Also, when you write to MongoDB, you can require that it fsync to disk. So, you get to choose the performance versus durability trade-off. An understandable criticism, though, is that it defaults to not writing to disk (or even waiting on a response from the server), which often takes people by surprise if they haven't read the documentation. – Robert Stewart Feb 04 '11 at 06:43
The entire database is not kept in RAM. MongoDB uses memory mapped files, which is different. Mongo tries hard to at least keep indexes in RAM, though. – Robert Stewart Feb 04 '11 at 06:44
1

I never presented my comment as a fact - i was only raising a q that one needs to know the definitive answer too. – mP. Feb 04 '11 at 21:40

How does MonogoDB stack up for very large data sets where only some of the data is volatile

4 Answers4