333

Can you share your thoughts how would you implement data versioning in MongoDB. (I've asked similar question regarding Cassandra. If you have any thoughts which db is better for that please share)

Suppose that I need to version records in an simple address book. (Address book records are stored as flat json objects). I expect that the history:

  • will be used infrequently
  • will be used all at once to present it in a "time machine" fashion
  • there won't be more versions than few hundred to a single record. history won't expire.

I'm considering the following approaches:

  • Create a new object collection to store history of records or changes to the records. It would store one object per version with a reference to the address book entry. Such records would looks as follows:

    {
     '_id': 'new id',
     'user': user_id,
     'timestamp': timestamp,
     'address_book_id': 'id of the address book record' 
     'old_record': {'first_name': 'Jon', 'last_name':'Doe' ...}
    }
    

    This approach can be modified to store an array of versions per document. But this seems to be slower approach without any advantages.

  • Store versions as serialized (JSON) object attached to address book entries. I'm not sure how to attach such objects to MongoDB documents. Perhaps as an array of strings. (Modelled after Simple Document Versioning with CouchDB)

Community
  • 1
  • 1
Piotr Czapla
  • 25,734
  • 24
  • 99
  • 122
  • 1
    I want to know if this has changed since the question was answered? I don't know much about oplog but was this around at the time, would it make a difference? – Randy L Jul 31 '14 at 23:40
  • My approach is to think of all data as a time series. –  Dec 03 '15 at 18:40
  • MongoDB blog has described a simple approach: [Building with Patterns: The Document Versioning Pattern](https://www.mongodb.com/blog/post/building-with-patterns-the-document-versioning-pattern) – Brent Bradburn Nov 07 '22 at 21:59

9 Answers9

168

The first big question when diving in to this is "how do you want to store changesets"?

  1. Diffs?
  2. Whole record copies?

My personal approach would be to store diffs. Because the display of these diffs is really a special action, I would put the diffs in a different "history" collection.

I would use the different collection to save memory space. You generally don't want a full history for a simple query. So by keeping the history out of the object you can also keep it out of the commonly accessed memory when that data is queried.

To make my life easy, I would make a history document contain a dictionary of time-stamped diffs. Something like this:

{
    _id : "id of address book record",
    changes : { 
                1234567 : { "city" : "Omaha", "state" : "Nebraska" },
                1234568 : { "city" : "Kansas City", "state" : "Missouri" }
               }
}

To make my life really easy, I would make this part of my DataObjects (EntityWrapper, whatever) that I use to access my data. Generally these objects have some form of history, so that you can easily override the save() method to make this change at the same time.

UPDATE: 2015-10

It looks like there is now a spec for handling JSON diffs. This seems like a more robust way to store the diffs / changes.

Community
  • 1
  • 1
Gates VP
  • 44,957
  • 11
  • 105
  • 108
  • 2
    Wouldn't you worry that such History document (the changes object) will grow in time and updates become inefficient? Or does MongoDB handles document grow easily? – Piotr Czapla Nov 16 '10 at 07:33
  • 7
    Take a look at the edit. Adding to `changes` is really easy: `db.hist.update({_id: ID}, {$set { changes.12345 : CHANGES } }, true)` This will perform an upsert that will only change the required data. Mongo creates documents with "buffer space" to handle this type of change. It also watches how documents in a collection change and modifies the buffer size for each collection. So MongoDB is designed for exactly this type of change (add new property / push to array). – Gates VP Nov 17 '10 at 05:59
  • 2
    I've done some testing and indeed the space reservation works pretty well. I wasn't able to catch the performance loss when the records were reallocated to the end of the data file. – Piotr Czapla Nov 27 '10 at 10:03
  • @GatesVP : How would you go about storing the ID of the user that made the change in this scheme? – UpTheCreek Sep 21 '11 at 13:07
  • I would just add it to the `changes.1234567` object. Given that the structure is flexible in MongoDB, this should be easy to pull out. – Gates VP Sep 21 '11 at 19:09
  • minor elaboration, when you query documents in mongo you can specify fields to include or exclude: http://www.mongodb.org/display/DOCS/Querying#Querying-FieldSelection So memory concerns alone may not force you to use a separate collection. Or at least, the memory concerns would only be on the mongod side, not the application side. That said I'd probably use a separate collection anyway just to keep things tidy, so not really arguing with the answer. – Havoc P Dec 02 '11 at 05:18
  • 1
    Specifying fields can save you from sending data across the wire, however, it doesn't make the actual document any smaller. When you query MongoDB, even with the specifier it will load the entire document into memory and then send you only the bits you need. You are correct, memory constraints will be on the `mongod` side, but it's easy to see how small documents with lots of changes can muck with the memory profile :) – Gates VP Dec 02 '11 at 23:55
  • When you write the changes to the separate changes collection, do you do, or do you of any way to do a diff on the old data to get only changes to write? My app doesn't diff the new data and can often have to save the same data over again due to usage of things like forms where all data is sent together. – Paul Shapiro Dec 19 '11 at 03:32
  • Also, isn't using "_id" really bad because it would conflict with the existing object's _id? – Paul Shapiro Dec 19 '11 at 20:05
  • The point with `_id` is to use the unique ID in `hist` collection. Because the ID is unique in the original, it will also be unique in the `hist`. I think diffs is really the way to go. Most "ORM" frameworks already know which fields have changed and only update those fields. MongoDB has a series of `$set` commands for modifying only changed fields. With this method, you're basically applying this twice: once to the original and once to `'changes.1234'`. – Gates VP Dec 20 '11 at 01:54
  • 5
    You can use https://github.com/mirek/node-rus-diff to generate (MongoDB compatible) diffs for your history. – Mirek Rusin Mar 11 '14 at 18:05
  • 1
    The [JSON Patch RFC](https://tools.ietf.org/html/rfc6902) provides a way to express difffs. It has [implementations in several languages](http://trac.tools.ietf.org/wg/appsawg/trac/wiki/JsonPatch). – Jérôme Oct 26 '15 at 16:25
  • @Jérôme: thanks, I'll add this as a note to the main question. Obviously this answer is older than even the JSON Patch spec, so it's definitely an "update". – Gates VP Oct 27 '15 at 18:14
  • 1
    Thank you guys for the reference to RFC 6902, here's a node implementation of it: https://www.npmjs.com/package/fast-json-patch – Doug Molineux Dec 12 '15 at 19:06
  • So how would a query look like that asks for an object of a given old version? Isn't the ability to retrieve old versions the purpose of keeping history? – Sergey Shcherbakov Sep 14 '16 at 22:30
  • @SergeyShcherbakov if you're doing diffs, then there is no reasonable query. You have to grab the current, grab the list of changes and then programatically work back in time. On a one-by-one basis, this is a reasonable way to audit a system. If you need full snapshots of several objects at a given point in time, then this methodology is completely insufficient. Frankly, MongoDB may be insufficient for such a use case. – Gates VP Sep 15 '16 at 18:14
33

There is a versioning scheme called "Vermongo" which addresses some aspects which haven't been dealt with in the other replies.

One of these issues is concurrent updates, another one is deleting documents.

Vermongo stores complete document copies in a shadow collection. For some use cases this might cause too much overhead, but I think it also simplifies many things.

https://github.com/thiloplanz/v7files/wiki/Vermongo

David Pfeffer
  • 38,869
  • 30
  • 127
  • 202
Marian
  • 14,759
  • 6
  • 32
  • 44
  • 5
    How do you actually use it? – hadees Dec 18 '12 at 18:49
  • 6
    There is no documentation on how this project is actually used. Is it something that lives withing Mongo somehow? It is a Java library? Is it merely a way of thinking about the problem? No idea and no hints are given. – ftrotter Feb 06 '13 at 06:41
  • 1
    This is actually a java app and the relavant code lives here: https://github.com/thiloplanz/v7files/blob/master/src/main/java/v7db/files/mongodb/Vermongo.java – ftrotter Feb 06 '13 at 07:48
30

Here's another solution using a single document for the current version and all old versions:

{
    _id: ObjectId("..."),
    data: [
        { vid: 1, content: "foo" },
        { vid: 2, content: "bar" }
    ]
}

data contains all versions. The data array is ordered, new versions will only get $pushed to the end of the array. data.vid is the version id, which is an incrementing number.

Get the most recent version:

find(
    { "_id":ObjectId("...") },
    { "data":{ $slice:-1 } }
)

Get a specific version by vid:

find(
    { "_id":ObjectId("...") },
    { "data":{ $elemMatch:{ "vid":1 } } }
)

Return only specified fields:

find(
    { "_id":ObjectId("...") },
    { "data":{ $elemMatch:{ "vid":1 } }, "data.content":1 }
)

Insert new version: (and prevent concurrent insert/update)

update(
    {
        "_id":ObjectId("..."),
        $and:[
            { "data.vid":{ $not:{ $gt:2 } } },
            { "data.vid":2 }
        ]
    },
    { $push:{ "data":{ "vid":3, "content":"baz" } } }
)

2 is the vid of the current most recent version and 3 is the new version getting inserted. Because you need the most recent version's vid, it's easy to do get the next version's vid: nextVID = oldVID + 1.

The $and condition will ensure, that 2 is the latest vid.

This way there's no need for a unique index, but the application logic has to take care of incrementing the vid on insert.

Remove a specific version:

update(
    { "_id":ObjectId("...") },
    { $pull:{ "data":{ "vid":2 } } }
)

That's it!

(remember the 16MB per document limit)

Benjamin M
  • 23,599
  • 32
  • 121
  • 201
  • With mmapv1 storage, everytime a new version is added to data, there is a possibility that document will be moved. – raok1997 Jan 07 '16 at 15:30
  • Yes, that's right. But if you just add new versions every once in while, this should be neglectable. – Benjamin M Jan 07 '16 at 15:44
14

If you're looking for a ready-to-roll solution -

Mongoid has built in simple versioning

http://mongoid.org/en/mongoid/docs/extras.html#versioning

mongoid-history is a Ruby plugin that provides a significantly more complicated solution with auditing, undo and redo

https://github.com/aq1018/mongoid-history

s01ipsist
  • 3,022
  • 2
  • 32
  • 36
10

I worked through this solution that accommodates a published, draft and historical versions of the data:

{
  published: {},
  draft: {},
  history: {
    "1" : {
      metadata: <value>,
      document: {}
    },
    ...
  }
}

I explain the model further here: http://software.danielwatrous.com/representing-revision-data-in-mongodb/

For those that may implement something like this in Java, here's an example:

http://software.danielwatrous.com/using-java-to-work-with-versioned-data/

Including all the code that you can fork, if you like

https://github.com/dwatrous/mongodb-revision-objects

Daniel Watrous
  • 3,467
  • 2
  • 36
  • 48
5

If you are using mongoose, I have found the following plugin to be a useful implementation of the JSON Patch format

mongoose-patch-history

bmw15
  • 223
  • 3
  • 8
4

Another option is to use mongoose-history plugin.

let mongoose = require('mongoose');
let mongooseHistory = require('mongoose-history');
let Schema = mongoose.Schema;

let MySchema = Post = new Schema({
    title: String,
    status: Boolean
});

MySchema.plugin(mongooseHistory);
// The plugin will automatically create a new collection with the schema name + "_history".
// In this case, collection with name "my_schema_history" will be created.
Muhammad Reda
  • 26,379
  • 14
  • 93
  • 105
1

I have used the below package for a meteor/MongoDB project, and it works well, the main advantage is that it stores history/revisions within an array in the same document, hence no need for an additional publications or middleware to access change-history. It can support a limited number of previous versions (ex. last ten versions), it also supports change-concatenation (so all changes happened within a specific period will be covered by one revision).

nicklozon/meteor-collection-revisions

Another sound option is to use Meteor Vermongo (here)

helcode
  • 1,859
  • 1
  • 13
  • 32
0

You can try javers I didn't find any better solution till now https://javers.org/

Piotr Żak
  • 2,083
  • 6
  • 29
  • 42