Synchronizing table data across databases

Question

I have one table that records its row insert/update timestamps on a field.

I want to synchronize data in this table with another table on another db server. Two db servers are not connected and synchronization is one way (master/slave). Using table triggers is not suitable

My workflow:

I use a global last_sync_date parameter and query table Master for the changed/inserted records
Output the resulting rows to xml
Parse the xml and update table Slave using updates and inserts

The complexity of the problem rises when dealing with deleted records of Master table. To catch the deleted records I think I have to maintain a log table for the previously inserted records and use sql "NOT IN". This becomes a performance problem when dealing with large datasets.

What would be an alternative workflow dealing with this scenario?

Currently using MsSql for master and mysql as slave, so I do not think that a database specific solution will help — Serkan Arıkuşu, Mar 05 '13 at 11:34
Also take a look to this slightly related link http://en.wikipedia.org/wiki/CAP_theorem — Guido, Mar 05 '13 at 12:19
The wording of the question doesn't fit its content and is suspect to me. If the servers aren't connected then your only option is an export file to a removable disk and transfer the data that way, otherwise they are connected somehow. Connected includes the web, so you have options like triggers available to you. The only reason you shouldn't be able to use them is a DBA denying it and I've never met a DBA who would deny the use of a trigger if it proved beneficial. Firewall restrictions and such can be overcome if you're doing this legitimately. — Just Aguy, Mar 10 '13 at 02:40
@JustAguy I feel for you as I offered the solution of a message queue because he tagged the question [tag:spring-batch]. IMHO Either he should use real replication or message queue. He also worried about a performance problem that is not even a problem yet (ie: *NOT IN* concerns). — Adam Gent, Mar 14 '13 at 00:13
@AdamGent Awesome! I made a sarcastic comment 3 days ago regarding the machines not being "connected", over-exaggerating things. Please tell me what you've pieced together in your head that would invoke such a random comment. — Just Aguy, Mar 14 '13 at 01:00
@JustAguy ... That whatever I recommend to this guy... he has bigger problems ;) — Adam Gent, Mar 14 '13 at 01:02
@JustAguy have you ever read about the nonsense of putting triggers using db links? What happens if the connection goes and they got invalidated? This sync job, as clearly stated in the question, must be a process that will not harm the business. — Serkan Arıkuşu, Mar 14 '13 at 04:14
@AdamGent The workflow described on the question is on air, we are using it successfully; but since it is a legacy system, I opened this question. So the performance problem "is" a problem. — Serkan Arıkuşu, Mar 14 '13 at 04:15
@SerkanArıkuşu I was just giving you crap. For the `NOT IN` you just have to make an temporary ids table. See my answer: http://stackoverflow.com/questions/2861230/what-is-the-best-approach-using-jdbc-for-parameterizing-an-in-clause/11119642#11119642 and http://stackoverflow.com/a/12927312/318174 . The only issue is it seems like you don't want to do any db changes... can you make tmp id's table? — Adam Gent, Mar 14 '13 at 04:56
@AdamGent I am really happy with your proposal of using message queues. I mentioned a "log table" in the question, this is the tmp id's table. — Serkan Arıkuşu, Mar 14 '13 at 05:02
@AdamGent Where did I suggest using triggers for this solution? Oh, that's right, I didn't. Seriously man, give it up. — Just Aguy, Mar 14 '13 at 14:07

Adam Gent · Accepted Answer · 2013-03-14T05:09:14.437

It sounds like you need a transactional message queue.

How this works is simple. When you update the master db you can send a message to the message broker (of whatever the update was) which can go to any number of queues. Each slave db can have its own queue and because queue's preserve order the process should eventually synchronize correctly (ironically this is sort of how most RDBMS do replication internally).

Think of the Message Queue as a sort of SCM change-list or patch-list database. That is for the most part the same (or roughly the same) SQL statements sent to master should be replicated to the other databases eventually. Don't worry about loosing messages as most message queues support durability and transactions.

I recommend you look at spring-amqp and/or spring-integration especially since you tagged this question with spring-batch.

Based on your comments:

See Spring Integration: http://static.springsource.org/spring-integration/reference/htmlsingle/ .
Google SEDA. Whether you go this route or not you should know about Message queues as it goes hand-in-hand with batch processing.
RabbitMQ has a good picture diagram of how messaging works
The contents of your message might be the entire row and whether its a CRUD, UPDATE, DELETE. You can use whatever format (e.g. JSON. See spring integration on recommendations).
- You could even send the direct SQL statements as a message!

BTW your concern of NOT IN being a performance problem is not a very good one as there are a plethora of work-arounds but given your not wanting to do DB specific things (like triggers and replication) I still feel a message queue is your best option.

EDIT - Non MQ route

Since I gave you a tough time about asking this quesiton I will continue to try to help. Besides the message queue you can do some sort of XML file like you we were trying before. THE CRITICAL FEATURE you need in the schema is a CREATE TIMESTAMP column on your master database so that you can do the batch processing while the system is up and running (otherwise you will have to stop the system). Now if you go this route you will want to SELECT * WHERE CREATE_TIME < ? is less than the current time. Basically your only getting the rows at a snapshot.

Now on your other database for the delete your going to remove rows by inner joining on a ID table but with != (that is you can use JOINS instead of slow NOT IN). Luckily you only need all the ids for delete and not the other columns. The other columns you can use a delta based on the the update time stamp column (for update, and create aka insert).

Can you explain this a little further for me to search on the subject? A simple workflow or the contents "message" will help me understand your proposal. Thx — Serkan Arıkuşu, Mar 13 '13 at 08:06
Try looking at Spring Integration and play around it and then ask some questions. I think once you play with it you will get an idea. — Adam Gent, Mar 13 '13 at 13:31
If he deletes rows from the master he is fine. He just sends the "DELETE" as a message and it will eventually make its way to the other database. As I said in the answer he can even send the raw SQL. Order is preserved in message queues. This idea is exactly how other databases do replication but with custom queues. That being said if he deletes rows from the slave w/o deleting from the master he is in trouble. — Adam Gent, Mar 14 '13 at 00:04
If he does not want to use triggers and does not want to update the apps accessing the database, who would then generate those events/messages? And how would they be tied to the single transactions such that there do not arise an inconsistencies? — user1050755, Mar 17 '13 at 21:56
It was unclear what he could and couldn't do. Regardless an event driven message queue is a good thing to have for future requirements. — Adam Gent, Mar 17 '13 at 22:06

score 4 · Answer 2 · answered Mar 05 '13 at 12:17

4

I am not sure about the solution. But I hope these links may help you.

http://knowledgebase.apexsql.com/2007/09/how-to-synchronize-data-between.htm

http://www.codeproject.com/Tips/348386/Copy-Synchronize-Table-Data-between-databases

answered Mar 05 '13 at 12:17

Shailesh Saxena

3,472
2
18
28

Unfortunately not, they do not fit for my use case, thank you anyway. – Serkan Arıkuşu Mar 05 '13 at 12:33

user1050755 · Answer 3 · 2013-03-12T15:02:35.570

2

Why don't you just add a TIMESTAMP column that indicates the last update/insert/delete time? Then add a deleted column -- ie. mark the row as deleted instead of actually deleting it immediately. Delete it after having exported the delete action.

In case you cannot alter schema usage in an existing app:

Can't you use triggers at all? How about a second ("hidden") table that gets populated with every insert/update/delete and which would constitute the content of the next to be generated xml export file? That is a common concept: a history (or "log") table: it would have its own progressing id column which can be used as an export marker.

edited Mar 12 '13 at 15:02

answered Mar 06 '13 at 00:41

user1050755

11,218
4
45
56

Thank you for your comment. I already have a timestamp column and using it in selects. Unfortunately I cannot have a deleted column, because this results in modifying the applications that really use the Master table. I want this sync job independent of the domain. – Serkan Arıkuşu Mar 06 '13 at 07:25

score 2 · Answer 4 · answered Mar 09 '13 at 08:30

Have a look at Oracle GoldenGate:

Oracle GoldenGate is a comprehensive software package for enabling the replication of data in heterogeneous data environments. The product set enables high availability solutions, real-time data integration, transactional change data capture, data replication, transformations, and verification between operational and analytical enterprise systems.

SymmetricDS:

SymmetricDS is open source software for multi-master database replication, filtered synchronization, or transformation across the network in a heterogeneous environment. It supports multiple subscribers with one direction or bi-directional asynchronous data replication.

Daffodil Replicator:

Daffodil Replicator is a Java tool for data synchronization, data migration, and data backup between various database servers.

score 1 · Answer 5 · answered Mar 05 '13 at 22:51

Very interesting question.

In may case I was having enough RAM to load all ids from master and slave tables to diff them.

If ids in master table are sequential you try to may maintain a set of full filled ranges in master table (ranges with all ids used, without blanks, like 100,101,102,103).

To find removed ids without loading all of them to the memory you may execute SQL query to count number of records with id >= full_region.start and id <= full_region.end for each full filled region. If result of query == (full_region.end - full_region.end) + 1 it means all record in region are not deleted. Otherwise - split region into 2 parts and do the same check for both of them (in a lot of cases only one side contains removed records).

After some length of range (about 5000 I think) it will faster to load all present ids and check for absent using Set.

Also there is a sense to load all ids to the memory for a batch of small (10-20 records) regions.

Thank you for your eye-opening solution; especially using ranges. Unluckily master and slave databases are unconnected, so I cannot load all ids from the slave table. But I will think about in memory solutions. — Serkan Arıkuşu, Mar 06 '13 at 07:32

score 1 · Answer 6 · answered Mar 11 '13 at 11:08

Make a history table for the table that needs to be synchronized (basically a duplicate of that table, with a few extra fields perhaps) and insert the entire row every time something is inserted/updated/deleted in the active table.

Write a Spring batch job to sync the data to Slave machine based on the history table's extra fields

hope this helps..

score 1 · Answer 7 · answered Mar 11 '13 at 21:39

A potential option for allowing deletes within your current workflow:

In the case that the trigger restriction is limited to triggers with references across databases, a possible solution within your current workflow would be to create a helper table in your Master database to store only the unique identifiers of the deleted rows (or whatever unique key would enable you to most efficiently delete your deleted rows).

Those ids would need to be inserted by a trigger on your master table on delete.

Using the same mechanism as your insert/updates, create a task following your inserts and updates. You could export your helper table to xml, as you noted in your current workflow.

This task would simply delete the rows out of the slave table, then delete all data from your helper table following completion of the task. Log any errors from the task so that you can troubleshoot this since there is no audit trail.

score 1 · Answer 8 · answered Mar 17 '13 at 21:52

1

If your database has a transaction dump log, just ship that one.

It is possible with MySQL and should be possible with PostgreSQL.

answered Mar 17 '13 at 21:52

user1050755

11,218
4
45
56

score 0 · Answer 9 · edited May 23 '17 at 10:29

0

I would agree with another comment - this requires the usage of triggers. I think another table should hold the history of your sql statements. See this answer about using 2008 extended events... Then, you can get the entire sql, and store the result query in the history table. Its up to you if you want to store it as a mysql query or a mssql query.

edited May 23 '17 at 10:29

Community

1
1

answered Mar 09 '13 at 01:39

Aaron Saray

1,178
6
19

score 0 · Answer 10 · answered Mar 12 '13 at 00:51

Here's my take. Do you really need to deal with this? I assume that the slave is for reporting purposes. So the question I would ask is how up to date should it be? Is it ok if the data is one day old? Do you plan a nightly refresh?

If so, forget about this online sync process, download the full tables; ship it to the mysql and batch load it. Processing time might be a lot quicker than you think.

Synchronizing table data across databases

10 Answers10