37

In our current development workflow we have introduced database migrations (using Ruckusing) to keep our developers' db schema's in sync. It works great, is pretty straightforward in use but now we have switched to git as VCS we are facing the next problem in our database versioning system.

When checking out a branch that has been in development for some time it might happen that the database schema has diverged alot from the schema in the branch i'm coming from. This causes database conflicts in some cases. Logically it seems that we need to run migrations depending on the branch we were on previously but that can get complex really fast and will run into problems with some people for sure. And as far as i know there isn't a db migration system that is branch-aware??

Added complexity comes when switching to a feature branch we might need to run some migrations up while other down ... technically this seems impossible using our current dbmigration scripts, are there any sane alternatives? Are there any preferred ways of working with database migrations in a very active and branched development system?

ChrisR
  • 14,370
  • 16
  • 70
  • 107
  • This si what I would call pandora's box. It's massively complex and needs a lot of rules and education for everybody working on the complete codebase. This includes rules to prioritize merging of branches containing db changes etc... – NDM Oct 06 '14 at 09:49

5 Answers5

29

I wouldn't really agree with incremental migrations being rotten. Having a set of homegrown scripts would, in my opinion, be a worse approach than having a real tool for such a job will make tracking those changes easier. I've had to deal with a similar situation myself, before, so hopefully I can share some of the insights.

To my experience RDBMS-schemas and branches don't mix very well. Depending on your branching the schemas should probably be at least somewhat similar, in which case the migrations should not differ too much. Or I might just have misunderstand the full extent of the problem. If you're e.g. trying to keep customer specific code on a branch, then maybe you should consider a way to modularize it instead. We did something like this, having rules that stated that customer specific schema changes, and code could only ever depend on the common code base, not the other way around. We also set the precedence between module changesets based on module and date, so we for most parts knew the order in which the changes were to be applied. YMMV, of course, but it's hard to give specifics, without knowing your current setup.

At my old company we successfully used a tool called Liquibase, which sounds similar to what you're using. Basically it is a tool for taking a DB schema, and all the data from one known state to another known state. The same changeset is applied only once, since liquibase maintains a changelog, with checksums. The changelogs are written in a specific XML format. I can strongly recommend to try it out, if you need alternatives.

Anyway, the way we handled customer code and branches, was to have a specific DB/schema for a given branch. This way you could have the schema and data from the branching point, and only migrate the diff to the current situation. We did not undo changes, even if liquibase in theory could support this, as we felt it was way too cumbersome and error prone. Given that liquibase keeps it's own state, the migration was always as easy as taking the current state on a given branch, and apply all. Only new changesets were applied, leaving the schema in a good state.

We used mercurial, which is distributed, like git, so the setup was quite similar. We also had developer specific local DBs on the dev laptops, and a number of environments, both for different customers and phases (development, integration, production), so the model was put to a real test, and it worked surprisingly well. We had some conflicts in the changesets, but we were mostly able to resolve those soon after the problem was introduced. The local development envs were really the hardest part, since during development some schema changes might have been introduced, which were not always compatible with later changesets, but the structured nature of the changes, and having a known state to revert to lead to very few real problems.

There are a few caveats with this approach:

  1. All and any changes to the schema must be implemented in the changesets. The biggest cause of confusion was always someone just fiddling around a bit.
  2. The first point also applies, even if you're using a tool that modifies the schema, e.g. a ORM-tool like Hibernate. You need to be pretty intimate with this tool to understand the changes it makes and requires.
  3. All users must buy into this, and be educated to follow the rules. Check 1.
  4. There comes a point when migrating lots of changesets starts taking too much time. At this time you will need to create a new baseline, which can be a bit tricky, especially with a lot of branches. It's good to plan ahead for this as well, and at least know of all existing DB-branches.
  5. You need to plan ahead a bit with the branches, to know whether they're going to migrate back to master at some point. Naive merging might not work well for schema changes.
  6. For very long lived branches and separated datasets this model might not be strong enough

The point is, however, that the more structure and control you have over the database, the easier migrations will be. Therefore tools like Liquibase could be a really valuable asset to help you track those changes. This applies to more complex models even to a greater extent, than to simple ones, so please at least don't consider dumping all tools you already have in place. And take some time to explore other alternative tools.

Some structure and control is better than none, or even worse, thinking you are in control with a big bunch of manual scripts.

Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
Kai Inkinen
  • 2,581
  • 2
  • 21
  • 21
  • 1
    You say you had "a specific DB/schema for a given branch" - do you mean you had some sort of reference copy of the database somewher, which you could copy to a developer's database? Or every developer had a separate schema for each branch? Or something else? – Tom Anderson Jun 30 '11 at 09:41
  • 2
    I'm not sure why you think having a big bunch of scripts is not being in control. If your team wrote and maintains those scripts, you're in *complete* control. You're not trusting some third-party tool to solve a complicated problem correctly, you're in directly in charge of what's being done. – Tom Anderson Jun 30 '11 at 09:43
16

I think the whole idea of incremental migrations is pretty rotten, really. In a complex environment like yours, it really doesn't work. You could make it work for simple branch patterns, but for anything complicated, it will be a nightmare.

The system i'm working with now takes a different approach: we have no ability to make incremental migrations, but only to rebuild the database from a baseline. During initial development, that baseline was an empty database, and during maintenance, it's a copy of the live database (restored from a dump). We just have a pile of SQL and XML scripts that we apply to the baseline to get a current system (migrations, essentially, but not designed to be run incrementally). Updating or switching branches is then very simple: nuke the database, load a dump to establish the baseline, run the scripts.

This process is not as quick as just running a few migrations, but it's quick enough. It takes long enough that you can go and get a cup of coffee, but not long enough to get lunch.

The huge advantage is that starting by nuking the database means the process is completely history-independent, so it doesn't need to know or care about crossing branches, going back in time, or anything else.

When you take a release live, you obviously do things slightly differently: you don't nuke the database or load a dump, because the system is already at the baseline (the baseline is defined as the state of the live system!). You just run the scripts. And after that, make a fresh dump to be used as a new baseline for development.

Tom Anderson
  • 46,189
  • 17
  • 92
  • 133
  • Is this a tried and proven technique or something you and your colleagues have just grown into? I'd like to do some more reading and research on it but don't know where to look or what to search for :) – ChrisR Jun 20 '11 at 11:15
  • 3
    It's our breakthrough discovery, which i'm sharing with you for a low, low price. Actually, i have no idea if this approach is widespread outside my company (actually, it's not even universal within my company!). It's not something i've come across in discussions about migrations etc. That would make a good question for [Programmers](http://programmers.stackexchange.com/), actually. – Tom Anderson Jun 20 '11 at 11:34
  • Mind you, two people liked this idea enough to upvote it, so perhaps they use this approach too? We'll probably never know. – Tom Anderson Jun 20 '11 at 11:35
  • 9
    How is running 'a pile of SQL and XML scripts' against the DB different from incremental migrations? Do you mean that the scripts do not depend upon each other? It sounds essentially the same as an incremental migration system (start from baseline, apply scripts in order). – Luke H Jun 25 '14 at 23:08
  • 1
    Given this some thought, and I think that, if I was using git, I'd be reluctant to adopt anything that made branch changes take longer than a few seconds. Would be interested to hear if @TomAnderson and his team ever found this to be a problem. – Alex Jan 31 '17 at 21:33
  • 1
    @Alex Oh yeah, using this approach to change branches took freaking forever! Initially "long enough that you can go and get a cup of coffee", but later in the project, >10 minutes, because of the amount of test data. So we added a caching layer: after a migration, the tool saves a dump on the shared database server; before a migration, it checks if there's a dump of the target state, and if there is, just loads it. That took the common case back to seconds. – Tom Anderson Feb 04 '17 at 15:00
  • 1
    That said, the slowness of the migrations was because we were loading a lot of test data using a very slow ETL tool which went via our ORM. If the test data had been SQL scripts, the rebuild would still have been really fast, and we wouldn't have needed the caching layer. – Tom Anderson Feb 04 '17 at 15:01
  • I take it you weren't doing feature branches? We have several devs and they're all working on small feature branches, which when complete, are merged back into master. We can't afford, money or time-wise, to have a separate db for each branch. – Tim Gautier May 22 '18 at 15:29
  • @TimGautier We weren't doing feature branches, but i don't see how it would be different. We did have separate development and release branches (the latter for urgent bugfixes etc), and it worked fine. Why do you think having a separate database for each branch would cost more time and money? I suspect we are starting with different assumptions here. – Tom Anderson May 29 '18 at 11:16
  • @TomAnderson Well, feature branches mean we have a lot of branches (10-20 at a time) and are creating new ones all the time (typically several a day). Our database is a couple terabytes and spinning up a new one isn't free or instant. We'd be spending all our time waiting for DBs to come up and the cost would be crazy. I've done the DB per branch thing and it works great, but only if you have a small DB or can tolerate very stale data. – Tim Gautier May 29 '18 at 16:58
  • @TimGautier Oh wow, if you have a >1 TB database then my method is indeed not a useful one at all! You might be able to use a similar approach by having a pool of prepared databases, depending on how much the database varies across branches. Or having a pool of databases at the baseline, if the migrations from there to the desired state are small. Either way, you would want a background job churning out fresh baseline databases to keep the pool full. – Tom Anderson May 30 '18 at 10:49
3

I'm in a similar situation where I work on a live website and several development branches in which I need to change the database schema.

I solved it by writing a post-checkout and a post-merge hook that can be used nicely with git. I store all my migrations in the form of SQL files in a separate directory and commit them alongside the changed PHP code. Each time I perform a

git checkout

or a

git merge

git will automatically call the appropriate up- and down-migrations. See my implementation on Github.

As a special request (for those of you who don't want to follow the github link) some more explanation:

Consider the following scenario. You have two branches:

  • master - which contains the website that's currently online
  • feature - which contains an unfinished new feature

For the new feature to work properly, it needs to change the database schema. The workflow is as follows:

  1. When, in you feature branch, you change your code which needs a change of the database schema, you also commit two new SQL files in the migrations directory, say:

    • 20151120130200-extra-field-up.sql (containing all the SQL queries to migrate upwards)
    • 20151120130200-extra-field-down.sql (containg all the SQL queries to migrate downwards)
  2. When you now perform a checkout to master, the post-receive git hook will:
    1. find all *-down.sql scripts in the commits from <new HEAD>..<old HEAD>
    2. execute those scripts with the local database
    3. find all *-up.sql scripts in the commits from <old HEAD>..<new HEAD>
    4. execute those scripts with the local database
  3. When you merge your feature branch into master, the post-merge hook will:
    1. find all *-up.sql scripts in the commits from master..feature
    2. execute those scripts with the local database

Install

Simply copy the post-checkout and/or post-merge file to the .git/hooks directory of your own git repository. You can edit the config section of those files. See the files themselves for an explanation.

Usage

The naming of the migration SQL files is crucial. They should end with up.sql or down.sql. The rest of the name is completely up to you. However if you have a single commit with multiple up-migrations and/or multiple down-migrations the order in which they are performed depends on the lexicographical order. Migration files that are within different commits, will always be called in the same (reverse) order as the commits.

It's not a requirement that you have both an up-migration and a down-upgrade, nor is it a requirement that up- and down-migrations are named similar.

  • You should add more details how these hooks work and how the migration scripts should be written. (Don't only link to the code and explanations because the link might get broken in the future.) – try-catch-finally Nov 21 '15 at 10:52
1

An approach I'm thinking of testing in our current project is to create a branch 'migrations' and all (and only) migrations are committed to this branch. Developers must merge from this branch into their current branch before creating a migration so that their migration is always created on top of the latest migration. All projects merge from this branch, so that every branch has a concept of a linear migration history. This gives every branch the ability to move back and forth between database versions. When switching to a branch that depends on a different version of the database, the developer applies whichever migration is appropriate.

The annoyance (besides the extra work and diligence of committing migrations to the special branch) is remembering which migration corresponds to a particular branch. I guess one way of doing this is instead of committing migrations directly onto the migrations branch, commit the migration (and only the migration) onto the current branch and then cherry pick that commit onto the migrations branch. Then you can just look at the last time the current branch cherry-picked onto the migrations branch and know that that diff contains the necessary migration. I think that would be possible. Also, the developer might create a migration just to see what changes would be necessary, and then try to infer which migration would be appropriate to use.

Sorry for the vague suggestion; if we do end up trying this approach I'll edit this suggestion with more concrete recommendations.

Carl G
  • 17,394
  • 14
  • 91
  • 115
0

This is something I've been working on lately. For me the problem haven't been that database schema has diverged per se, but rather, git can't merge them together. Feature branches which touch the database schema are always scary.

The solution I've been thinking on is that instead of having linear migrations, have migrations that depend on other migrations. You get a nice dependency graph of your migrations which is easy enough to linearize (topological sort). Just keep track of the named migrations in your database and in the correct order execute the updates that aren't already updated.

For example, addCustomerSalt depends on initialSchema, and separateAddress depends on person.

The one problem this does not solve is that if branch A depends on update Z which has been created in branch B, but maybe in that case, you should rebase to a common ancestor?

Masse
  • 4,334
  • 3
  • 30
  • 41