How to handle application death and other mid-operation faults with Mongo DB

Question

Since Mongo doesn't have transactions that can be used to ensure that nothing is committed to the database unless its consistent (non corrupt) data, if my application dies between making a write to one document, and making a related write to another document, what techniques can I use to remove the corrupt data and/or recover in some way?

Atomic data should always be written into exactly one document. While this isn't always possible (which may well be an indication that MongoDB is the wrong DBMS to use), proper data modelling often offers a solution. Please describe your use case and show us your data model – more often than not, a solution can be found. — Markus W Mahlberg, Feb 03 '15 at 22:08
I appreciate the offer of help, but I'm looking for more generalized techniques. It would be great if you could write up an answer with a couple different examples, each showing a different "proper data modeling" technique (and maybe one showing where Mongo would definitely be the wrong DBMS to use). — B T, Feb 03 '15 at 22:38
That would be very much out of scope. It would be relatively easy to write a book on data modelling for MongoDB ;) in general: write atomic data into one document. An example would be to write an order document for a webshop which includes prices for the given point in time. If the app dies, no harm will be done, except the loss of the immediate order document, which should be easy to recreate of the basket. Without a use case and the properties of your objects, there are virtually endless possibilities. — Markus W Mahlberg, Feb 04 '15 at 17:03
Yes there are endless possibilities if you attempt to think of every scenario with the detail of "I have a webshop with a shopping cart blah blah blah..". But I don't think its out of scope to ask for a list of general *types* of problems, and their solutions. — B T, Feb 05 '15 at 00:35
For example, an example that fits part of my situation is where I have a Permission document that defines a list of users who have permission to access my main Data document. In this situation, a catastrophic failure could either leave an unused Permission doc, or leave a bad _id in the Data doc. The solution here would be to create Permission first, then write its _id to the Data doc, because having an unused Permission is better than having a bad _id in a Data document. This is the kind of generaliation i'm talking about. What kinds of scenarios are there? It doesn't need to be exhaustive — B T, Feb 05 '15 at 00:36

score 2 · Answer 1 · answered Feb 04 '15 at 09:51

The greater idea behind NoSQL was to use a carefully modeled data structure for a specific problem, instead of hitting every problem with a hammer. That is also true for transactions, which should be referred to as 'short-lived transactions', because the typical RDBMS transaction hardly helps with 'real', long-lived transactions.

The kind of transaction supported by RDBMSs is often required only because the limited data model forces you to store the data across several tables, instead of using embedded arrays (think of the typical invoice / invoice items examples).

In MongoDB, try to use write-heavy, de-normalized data structures and keep data in a single document which improves read speed, data locality and ensures consistency. Such a data model is also easier to scale, because a single read only hits a single server, instead of having to collect data from multiple sources.

However, there are cases where the data must be read in a variety of contexts and de-normalization becomes unfeasible. In that case, you might want to take a look at Two-Phase Commits or choose a completely different concurrency approach, such as MVCC (in a sentence, that's what the likes of svn, git, etc. do). The latter, however, is hardly a drop-in replacement for RDBMs, but exposes a completely different kind of concurrency to a higher level of the application, if not the user.

So are you really saying that as soon as I can't write everything into one document, I should look to MVCC and Two-Phase Commits? Surely there's something in between. — B T, Feb 05 '15 at 00:30
Yes. Of course, you could simply accept unused documents as you describe in your last comment, but that approach hardly "removes or recovers corrupt data in some way". It is trivial to write a cleaner application to remove such documents where a reference is missing, but usually not worth the effort because you'll probably not need it more than maybe once a year... If you need it all the time, it means your app keeps crashing which is a big problem in its own regard... — mnemosyn, Feb 05 '15 at 18:29
If catastrophic failure means that your app is hella broken until your company can scramble to correct its data, even a day of downtime can be an incredible hit to a business. I'd say its worth it to figure out what will happen to your data in the event of this kind of failure. — B T, Feb 05 '15 at 21:46

B T · Answer 2 · 2015-02-05T23:06:05.183

Thinking about this myself, I want to identify some categories of affects:

Your operation has only one database save (saving data into one document)
Your operation has two database saves (updates, inserts, or deletions), A and B
1. They are independent
2. B is required for A to be valid
3. They are interdependent (A is required for B to be valid, and B is required for A to be valid)
Your operation has more than two database saves

I think this is a full list of the general possibilities. In case 1, you have no problem - one database save is atomic. In case 2.1, same thing, if they're independent, they might as well be two separate operations.

For case 2.2, if you do A first then B, at worst you will have some extra data (B data) that will take up space in your system, but otherwise be harmless. In case 2.3, you'll likely have some corrupt data in the event of a catastrophic failure. And case 3 is just a composition of case 2s.

Some examples for the different cases:

1.0. You change a car document's color to 'blue'

2.1. You change the car document's color to 'red' and the driver's hair color to 'red'

2.2. You create a new engine document and add its ID to the car document

2.3.a. You change your car's 'gasType' to 'diesel', which requires changing your engine to a 'diesel' type engine.

2.3.b. Another example of 2.3: You hitch car document A to another car document B, A getting the "towedBy" property set to B's ID, and B getting the "towing" property set to A's ID

3.0. I'll leave examples of this to your imagination

In many cases, its possible to turn a 2.3 scenario into a 2.2 scenario. In the 2.3.a example, the car document and engine are separate documents. Lets ignore the possibility of putting the engine inside the car document for this example. Its both invalid to have a diesel engine and non-diesel gas and to have a non-diesel engine and diesel gas. So they both have to change. But it may be valid to have no engine at all and have diesel gas. So you could add a step that makes the whole thing valid at all points. First, remove the engine, then replace the gas, then change the type of the engine, and lastly add the engine back onto the car.

If you will get corrupt data from a 2.3 scenario, you'll want a way to detect the corruption. In example 2.3.b, things might break if one document has the "towing" property, but the other document doesn't have a corresponding "towedBy" property. So this might be something to check after a catastrophic failure. Find all documents that have "towing" but the document with the id in that property doesn't have its "towedBy" set to the right ID. The choices there would be to delete the "towing" property or set the appropriate "towedBy" property. They both seem equally valid, but it might depend on your application.

In some situations, you might be able to find corrupt data like this, but you won't know what the data was before those things were set. In those cases, setting a default is probably better than nothing. Some types of corruption are better than others (particularly the kind that will cause errors in your application rather than simply incorrect display data).

If the above kind of code analysis or corruption repair becomes unfeasible, or if you want to avoid any data corruption at all, your last resort would be to take mnemosyn's suggestion and implement Two-Phase Commits, MVCC, or something similar that allows you to identify and roll back changes in an indeterminate state.

that handles only inserts. take updates into account and it becomes way more complicated, because the data might be invalid, but look valid — mnemosyn, Feb 05 '15 at 22:48
My categories include updates. A "save" is any update, insert, or deletion. What's an example that's not covered by those categories? — B T, Feb 05 '15 at 22:52
Your operations are not *atomic*. There might be a second writer. Thread A updates item 1. Now thread B looks at item 1, identifying it as being in a state where it can perform its own update on related item 2. Thread B overwrites that change immediately after (or does a consistency check and fails because thread B was faster). This leads to byzantine bugs. You are reducing all faults to network partitions / machine crashes, but wrong code and multithreading is a much more common source of trouble. — mnemosyn, Feb 05 '15 at 22:57
Ok. My question has nothing to do with "wrong code and multithreading", so those things are off topic. My question is *specifically* about problems stemming from machine crashes and other mid-operation application faults. — B T, Feb 05 '15 at 23:00

How to handle application death and other mid-operation faults with Mongo DB

2 Answers2

Linked