0

I have a Table in sql server consisting of 200 million records in two different servers. I need to move this table from Server 1 to Server 2.

Table in server 1 can be a subset or a superset of the table in server 2. Some of the records(around 1 million) in server 1 are updated which I need to update in server 2. So currently I am following this approach :-

1) Use SSIS to move data from server 1 to staging database in server 2.

2) Then compare data in staging with the table in server 2 column by column. If any of the column is different, I update the whole row.

This is taking a lot of time. I tried using hashbytes inorder to compare rows like this:- HASHBYTES('sha',CONCAT(a.[account_no],a.[transaction_id], ...))

<>

HASHBYTES('sha',CONCAT(b.[account_no],b.[transaction_id], ...))

But this is taking even more time.

Any other approach which can be faster and can save time?

  • Can you provide table schemas and execution plans? Is the table on server 2 in use during this operation? – David Rushton Aug 09 '17 at 10:10
  • `BINARY_CHECKSUM(*)` might be slightly faster to initially determine if rows are definitely not identical, but it's unlikely the comparisons are the bottleneck in the first place, so I wouldn't focus on that. What does the whole query you use for the update look like, and what indexes are on the tables? – Jeroen Mostert Aug 09 '17 at 10:26
  • Some options here. https://stackoverflow.com/questions/3711217/fastest-way-to-update-120-million-records – SQLAndOtherStuffGuy Aug 09 '17 at 10:30
  • If you are using SSIS, you might want to try and create a Script Component / Script Task (C#) in order to load your record sets into DataTables - C# is pretty fast if it comes to in-memory comparisons. – Tyron78 Aug 09 '17 at 13:43

1 Answers1

0

This is a problem that's pretty common.

First - do not try and do the updates directly in SQL - the performance will be terrible, and will bring the database server to its knees.

In context, TS1 will be the table on Server 1, TS2 will be the table on Server 2 Using SSIS - create two steps within the job: First, find the deleted - scan TS2 by ID, and any TS2 ID that does not exist in TS1, delete it. Second, scan TS1, and if the ID exists in TS2, you will need to update that record. If memory serves, SSIS can inspect for differences and only update if needed, otherwise, just execute the update statement. While scanning TS1, if the ID does not exist in TS2, then insert the record.

I can't speak to performance on this due to variations in schemas as servers, but it will be compute intensive to analyze the 200mm records. It WILL take a long time.

For on-going execution, you will need to add a "last modified date" timestamp to each record and a trigger to update the field on any legitimate change. Then use that to filter out your problem space. The first scan will not be terrible, as it ONLY looks at the IDs. The insert/update phase will actually benefit from the last modified date filter, assuming the number of records being modified is small (< 5%?) relative to the overall dataset. You will also need to add an index to that column to aid in the filtering.

The other option is to perform a burn and load each time - disable any constraints around TS2, truncate TS2 and copy the data into TS2 from TS1, finally reenabling the constraints and rebuild any indexes.

Best of luck to you.

GKlesczewski
  • 192
  • 4