TL;DR what's the best way to migrate lots of data between one very poorly structured database (with much repetition of columns, no interrelations, and duplicate data), to another highly organised and relational structure? - Sorry about the long read!
I've recently taken on a very complex job. It's rewriting an entire company's web-based IT platform. I'm afraid I can't give too many details because we can't let the old developer know (he has a metaphorical gun against the company's head, in that he's the only one who knows how to do critical things like invoice generation, and is demanding more and more money).
The major problem is that the entire web platform (used by all the staff and all the customers) was coded by a guy who's skills were less than amateur. It's made up of ~300 individual code files. There's no template library - it's all hardcoded into each file. There's no logical database structure - it was practically made up as he went along. There's no security - it's shocking. Anyway, we will be rewriting this whole platform over a ~3-month period.
However the boss says that on the morning it goes live, no customer data can be lost anywhere. The whole database contents has to be copied over directly. The structure of the database is currently so poor it's nearly impossible to work with, but this week we will be (attempting to!) write some scripts to migrate it over to our new, highly relational structure which is far more logical. The question is, what's the best way to do this?
One example is addresses. In the old database, addresses are used in about 12 tables (of 44 total...). In ours, we are having one addresses
table which will be cross-referenced by other tables (e.g. address_id
) to keep things clean. The main problem is that in about half of his tables, the addresses are stored as line1
, line2
, town
, city
, etc, which is fine, but in the other half he just has a single address
field which stores the whole thing!
A second example is dates - in some tables he has seconds-since-Epoch dates, in others MySQL NOW()
dates, and in others he literally stores it in 6 columns per row - year
, month
, day
, hour
, minute
, second
- ouch...
What's a good way to go about trying to tackle this? Should we look at our tables and work out where we need to pull his data from into ours, or should we reverse this and look at his tables and work out where his data needs to go into ours?
From a programming standpoint, how should we tackle this? A lot of data needs dynamic formatting (e.g. dates) so we were thinking of plucking data one row at a time, formatting it correctly, then re-inserting it into the right places in our scripts.
Speed and efficiency of queries is not an issue for us, as we will only need to run this once (after testing), on our local machines. His database is currently ~800MB when SQL dumped, but again a lot of this is his useless test data, or just totally unnecessary.
Any ideas on the best way to tackle this? For reference our system will be re-written in PHP so any PHP-based recommendations would be nice. The database is currently (and still will be) in MySQL.