1

Speaking in general, I want to know what are the best practices for querying (and therefore indexing) of schemaless data structures? (i.e. documents)

Lets say I use MongoDB to store and query deterministic data structures in a collection. At this point all documents have the same structure therefore I can easily create indexes for any queries in my app since I know each document has required field(s) for the index.

What happens after I change the structure and try to save new documents to the db? Lets say I joined two fields FirstName and Lastname to FullName. As a result the collection contains nondeterministic data. I see two problems here:

  • Old indexes cannot cover new data, therefore new indexes needed that handle both fields old and new
  • App should take care of dealing with two representations of the documents

This may result in a big problem when there are many changes in the db resulting in many versions of document structures.

I see two main approaches:

  • Lazy migration. This means that each document is migrated on demand (i.e. only after loading from collection) to final structure and then stored back to colection. This approach actually does not solve the problems because it concedes nondeterminism at any point of time.
  • Forced migration. This is the same approach as for RDBMS migrations. The migration is performed for all documents at one point of time while the app does not run. The main con is downtime of the app.

So the question: Is there any good way of solving the problem, especially without app downtime?

Philipp
  • 67,764
  • 9
  • 118
  • 153
neleus
  • 2,230
  • 21
  • 36
  • Every NoSQL database is different and provides different tools and techniques for solving this problem. Can you please narrow your question down to a specific database? – Philipp Jun 27 '14 at 16:54
  • yes, there are two databases I look into: mongodb and couchbase – neleus Jun 27 '14 at 16:56
  • Can you please narrow your question down to **one** specific database? – Philipp Jun 27 '14 at 17:01
  • I would like to examine both but lets the first be mongo – neleus Jun 27 '14 at 17:05
  • A common way to handle to change like this on a live system is to support both old and new schema during the transition. So for your name example, update your collection to have first, last and full_name fields. Once all documents have full_name you can change your application to start reading, stop writing first & last and then finally remove first and last. – James Wahlin Jun 27 '14 at 17:15

1 Answers1

1

If you can't have downtime then the only choice is to do the migrations "on the fly":

  1. Change the application so that when new documents are saved the new field is created, but read from the old ones.
  2. Update your collection with a script/queries to add the new field in the collection.
  3. Create new indexes on that field.
  4. Change the application so that it reads from the new fields.
  5. Drop the unnecessary indexes and remove the old fields from the documents.

Changing the schema on a live database is never an easy process, no matter what database you use. It always requires some forward thinking and careful planning.

is indexing a pain?

Indexing is not a pain, but premature optimization is. You should always test and check that you actually need indexes before adding them and when you have them, check that they are being properly used.

If you're worried about performance issues on a live system when creating indexes, then you should consider having replica sets and doing rolling maintenance (in short: taking secondaries down from replication, creating indexes on them, bringing them back into replication and then repeating the process for all the subsequent replica set members).

Edit

What I was describing is basically a process of migrating your schema to a new one while temporary supporting both versions of the documents.

In step 1, you're basically adding support for multiple versions of documents. You're updating existing documents i.e. creating new fields, while you're reading data from the previous version fields. Step 2 is optional, because you can gradually update your documents as they are being saved.

In step 4 you're removing the support for the previous versions from your application code and migrating to a new version. Finally, in step 5 you're removing the previous version fields from your actual MongoDB documents.

Christian P
  • 12,032
  • 6
  • 60
  • 71
  • 1
    I found similar question on SO (http://stackoverflow.com/a/9326435/2528649) and it seems that I'll go the same way – neleus Jun 30 '14 at 14:12
  • Assuming I know exactly what indexes I need and when the system is so large that the downtime is not an option I see the most convenient way is supporting two versions of the data structures: _previous_ and _actual_. On the other hand all versions prior to _previous_ are not supported, this gives relatively simple indexes (and queries) but also imposes some restrictions: the data cannot be migrated to the next version until data in _previous_ version exist. – neleus Jun 30 '14 at 14:28
  • @neleus see my update for some clarifications about my answer. – Christian P Jun 30 '14 at 14:47
  • Yeah, the clarifications helped. Regarding the step 2, updating gradually may last forever so this step is also vital. – neleus Jun 30 '14 at 15:23
  • @neleus of course, for some use cases e.g. where documents are not updated regularly, you would not use gradual update. But I added that step 2 is optional, because there can be a case where you actually can use gradual updates (e.g. all of your documents are updated regularly so you don't need to worry about manual updating). – Christian P Jun 30 '14 at 16:31