1

We are overhauling our product by completely moving from Microsoft and .NET family to open source (well one of the reasons is cost cutting and exponential increase in data).

We plan to move our data model completely from SQL Server (relational data) to Hadoop (the famous key-Value pair ecosystem).

In the beginning, we want to support both versions (say 1.0 and new v2.0). In order to maintain the data consistency, we plan to sync the data between both systems, which is a fairly challenging task and error prone, but we don't have any other option.

A bit confused where to start from, I am looking up to the community of experts. Any strategy/existing literature or any other kind of guidance in this direction would be greatly helpful.

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
Panks
  • 601
  • 1
  • 11
  • 20

2 Answers2

1

I am not entirely sure how your code is structured, but if you currently have a data or persistence layer, or at least a database access class where all your SQL is executed through, you could override the save functions to write changes to both databases. If you do not have a data layer, you may want to considering writing one before starting the transition.

Otherwise, you could add triggers in MSSQL to update Hadoop, not sure what you can do in Hadoop to keep MSSQL in-sync.

Or, you could have a process that runs every x minutes, that manually syncs the two databases.

Personally, I would try to avoid trying to maintain two databases of record. Moving changes from a new, experimental database to your stable database seems risky. You stand the chance of corrupting your stable system. Instead, I would write a convertor to move data from your relational DB to Hadoop. Then every night or so, copy your data into Hadoop and use it for the development and testing of your new system. I think test users would understand if you said your beta version is just a test playground, and won't effect your live product. If you plan on making major changes to your UI and fear some will not want to transition to 2.0, then you might be trying to tackle too much at once.

Those are the solutions I came up with... Good luck!

Justin Fisher
  • 1,813
  • 1
  • 13
  • 12
  • you are right. doing a bidirectional sync would mess up our stable database. we decided to keep it unidirectional. and keep an intermediate queue for the new version. If a user wants to experiment the new Beta, his data will eventually be synced to the stable DB. We have to take the rare chances of data inconsistency in this case. But it's temporary. Thanks Justin! -Panks – Panks Jun 16 '11 at 17:07
  • Two way synchronization seemed to be risky. We decided to keep a central point of in-memory-centralized cache that will be responsible to sync two different data repositories (having some intermediate adapters/transformers) – Panks Jun 20 '11 at 06:56
  • Justin could you please also go through http://stackoverflow.com/questions/6408091/real-time-unidirectional-synchronization-from-sql-server-to-another-data-reposito – Panks Jun 20 '11 at 07:35
0

Consider using a queuing tool like Flume (http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3b2-flume/) to split your input between both systems.

David Medinets
  • 5,160
  • 3
  • 29
  • 42