5

I've been read a lot about MongoDB recently, but one topic I can't find any clear material on, is how data is written to the journal and oplog.

So this is what I understand of the process so far, please correct me where I'm wrong

  • A client connect to mongod and performs a write. The write is stored in the socket buffer
  • When Mongo is available (not sure what available means at this point), data is written to the journal?
  • The mongoDB docs then say that writes every 60 seconds are flushed from the journal onto disk. By this I can only assume this mean written to the primary and the oplog. If this is the case, how to writes appear earlier than the 60 seconds sync interval?
  • Some time later, secondaries suck data from the primary or their sync source and update their oplog and databases. It seems very vague about when exactly this happens and what delays it.

I'm also wondering if journaling was disabled (I understand that's a really bad idea), at what point does the oplog and database get updated?

Lastly I'm a bit stumpted at which points in this process, the write locks get created. Is this just when the database and oplog are updated or at other times too?

Thanks to anyone who can shed some light on this or point me to some reading material.

Simon

1 Answers1

4

Here is what happens as far as I understand it. I simplified a bit, but it should make clear how it works.

  1. A client connects to mongo. No writes done so far, and no connection torn down, because it really depends on the write concern what happens now.Let's assume that we go with the (by the time of this writing) default "acknowledged".
  2. The client sends it's write operation. Here is where I am really not sure. Either after this step or the next one the acknowledgement is sent to the driver.
  3. The write operation is run through the query optimizer. It is here where the acknowledgment is sent because with in an acknowledged write concern, you may be returned a duplicate key error. It is possible that this was checked in the last step. If I should bet, I'd say it is after this one.
  4. The output of the query optimizer is then applied to the data in memory Actually to the data of the memory mapped datafiles, to the memory mapped oplog and to the journal's memory mapped files. Queries are answered from this memory mapped parts or the according data is mapped to memory for answering the query. The oplog is read from memory if present, too.
  5. Every 100ms in general the journal is synced to disk. The precise value is determined by a number of factors, one of them being the journalCommitInterval configuration parameter. If you have a write concern of journaled, the driver will be notified now.
  6. Every syncDelay seconds, the current state of the memory mapped files is synced to disk I think the journal is truncated to the entries which weren't applied to the data yet, but I am not too sure of that since that it should basically never happen that data in the journal isn't yet applied to the current data.

If you have read carefully, you noticed that the data is ready for the oplog as early as it has been run through the query optimizer and was applied to the files mapped into memory. When the oplog entry is pulled by one of the secondaries, it is immediately applied to it's data of the memory mapped files and synced in the disk the same way as on the primary.

Some things to note: As soon as the relatively small data is written to the journal, it is quite safe. If a node goes down between two syncs to the datafiles, both the datafiles and the oplog can be restored from their last state in the datafiles and the journal. In general, the maximum data loss you can have is the operations recorded into the log after the last commit, 50ms in median.

As for the locks. If you have written carefully, there aren't locks imposed on a database level when the data is synced to disk. Write locks may be created in order to assure that only one thread at any given point in time modifies a given document. There are other write locks possible , but in general, they should be rather rare.

Write locks on the filesystem layer are created once, though only implicitly, iirc. During application startup, a lock file is created in the root directory of the dbpath. Any other mongod instance will refuse to do any operation on those datafiles while a valid lock exists. And you shouldn't either ;)

Hope this helps.

Markus W Mahlberg
  • 19,711
  • 6
  • 65
  • 89
  • 1
    Thanks for the informative response, I've be so busy to even reply. It's good to know I'm not the only one who doesn't completely understand the process. I think I'm understanding you, but to confirm, are you saying all data is originally written to memory and then every 100ms (or whatever it's configured to), the journal is written to disk. Then every SyncDelay seconds the oplog and database data is written to disk. –  Sep 17 '14 at 14:20
  • That does make sense, my one concern with this however is that, if the data is dropped from memory before it's synced to disk. Would it mean that the changes would not be synced? The journal would still be on disk (syncing every 100ms) so this maybe is replayed if missing from memory. Still if the memory is dropped, data could still be missing until the journal is replayed on at the syncDelay seconds. Do you know if Mongo has a way to combat this? Thanks for the tips on the locks too, that has definitely cleared things up. –  Sep 17 '14 at 14:21
  • There is no such thing as 100% security. If the data is only in memory and the server fails, the data _will_ be lost. If you want something reducing the probability for that to happen, you have to set a write concern of w>1 or majority and have geographically distributed replica set members – Markus W Mahlberg Sep 17 '14 at 15:42
  • There is a great set of illustrations with explanation on the [MongoDB blog](http://blog.mongodb.org/post/33700094220/how-mongodbs-journaling-works) for how the journaling process works. – wdberkeley Sep 18 '14 at 19:03
  • I wasn't really referring to data durability, in the sense of losing data in memory. My line of thought was more aim towards the possibility of data being removed from memory and instead placed on the disk journal, meaning the data will not appear on reads until the disk journal is flushed. Thanks for the link, I read it on Kristina Chodorow’s blog the day I posted this :) –  Sep 23 '14 at 15:50
  • Data is not read from disk, but from a subset held in memory (simplified). So you'll always get the current data, no matter if it's flushed to disk yet or not. – Markus W Mahlberg Sep 23 '14 at 19:50