Fastest options for merging two tables in SQL Server

Question

Consider two very large tables, Table A with 20 million rows in, and Table B which has a large overlap with TableA with 10 million rows. Both have an identifier column and a bunch of other data. I need to move all items from Table B into Table A updating where they already exist.

Both table structures
 - Identifier int
 - Date DateTime,
 - Identifier A
 - Identifier B
 - General decimal data.. (maybe 10 columns)

I can get the items in Table B that are new, and get the items in Table B that need to be updated in Table A very quickly, but I can't get an update or a delete insert to work quickly. What options are available to merge the contents of TableB into TableA (i.e. updating existing records instead of inserting) in the shortest time?

I've tried pulling out existing records in TableB and running a large update on table A to update just those rows (i.e. an update statement per row), and performance is pretty bad, even with a good index on it.

I've also tried doing a one shot delete of the different values out of TableA that exist in TableB and performance of the delete is also poor, even with the indexes dropped.

I appreciate that this may be difficult to perform quickly, but I'm looking for other options that are available to achieve this.

Pretty sure (correct me if I'm wrong) but Brent advises against using merge -"Uh oh - bad news about the MERGE statement. It's known to be buggy, and Microsoft hasn't fixed the bugs: https://www.mssqltips.com/s..." via https://gavindraper.com/2017/05/08/usgin-sql-server-merge/ — MrEdmundo, Apr 19 '18 at 08:46
Speed depends on *indexes*. Whether you use UPDATE/INSERT or MERGE won't affect speed as much as having the correct indexes. Even better, *don't* merge. Use change tracking to find and upsert *only* the rows that changed since the last upsert operation — Panagiotis Kanavos, Apr 19 '18 at 08:48
The OP should make it clearer but I happen to know that everything in Table B needs to make it in to Table A i.e. the "change tracking" has already occurred. — MrEdmundo, Apr 19 '18 at 08:50
What is the logic? Do you have a target table which gets regularly updated by a source table? Is it a one time job? — Bonzay, Apr 19 '18 at 08:50
What are the table schemas? What kind of data is stored in those tables? Is one of them a staging table perhaps? A source and reporting table? The specifics matter a *lot* in ETL (which is what you want to do here). Can you use partition switching perhaps? Change tracking? Columnstores? What does the execution plan look like? Is the join operation slow or the index update? BTW if you drop an index, DELETE will go *slower* because it has to scan the entire table to find what to delete. — Panagiotis Kanavos, Apr 19 '18 at 08:55
I've added some clarification to the question, I'm open to further suggestions — gmn, Apr 19 '18 at 09:02
still there is not enough info about the problem you are trying to solve. It is not just a coding issue, there is much to consider for a proper design. For example you could create a generator table which will include a binary flag to find the IDs that are to be updated and those to be inserted without going through the whole target table.But it all depends on the use case you are trying to solve. I suggest you look a little bit more into ETLs like @PanagiotisKanavos said. good luck — Bonzay, Apr 19 '18 at 09:21
@PanagiotisKanavos, "DELETE will go slower because it has to scan the entire table to find what to delete". - This is nonsense. SQL Server can use indexes to find rows for DELETE the same way as it would find rows for SELECT operation. Therefore right indexes can speedup deletes on large tables very much. — Alexander Volok, Apr 19 '18 at 09:25
@AlexanderVolok so you agree with me. Why is *your* comment nonsense then? Perhaps you misread *and* misquoted my comment? `if you drop an index, DELETE will go slower because it has to scan the entire table` — Panagiotis Kanavos, Apr 19 '18 at 10:45
@gmn you need to provide *more* information about the tables and the data they contain. For example, if the data can be partitioned eg by date or product, and it makes sense to replace an entire target partition with the data from a source partition you could create a staging table with the same partitioning scheme (eg by year or semester), fill it with the new data and then swap partitions between the staging and target tables. That's an online operation, instant operation. Partitioning is available in all SQL Server versions since 2016 SP1, even Express — Panagiotis Kanavos, Apr 19 '18 at 11:05
@gmn you could use partitioning even if the source data doesn't completely replace the target data. You could insert overlapping data from the source and *non*overlapping data from the target into the staging table for example to create full partitions. — Panagiotis Kanavos, Apr 19 '18 at 11:16
@PanagiotisKanavos partitioning is an interesting angle and exactly the sort of suggestion I was looking for thanks. I will have to experiment — gmn, Apr 19 '18 at 11:20
Delete? How does delete fit into update and insert? Please post your slow TSQL. — paparazzo, Apr 19 '18 at 11:22
@paparazzo it does fit and is a common technique for *staging/intermediate* tables. There's no such thing as bulk update or minimally logged update. Instead of updating a *lot* of rows you can delete the target rows and insert their replacements. If the query is written properly the server will log entire data pages instead of individual transactions. You can only do that on staging tables though, otherwise the users will notice the missing data — Panagiotis Kanavos, Apr 19 '18 at 11:30
@PanagiotisKanavos OK I would still like to hear from the OP. We don't even know how these tables are joined (common key). It makes no sense delete is slow. OP specifically asked updating of existing. — paparazzo, Apr 19 '18 at 11:32
@paparazzo for the same reason, using SSIS can be *faster* than an INSERT .. SELECT. The `INSERT .. SELECT` will take locks on both tables and possibly spill into tempdb while trying to move the data. Complex SELECT queries also mean you can't use minimally logged inserts. SSIS on the other hand will read data from the source with a firehose cursor *as a stream of rows* and insert it to the target with a bulk insert operation. — Panagiotis Kanavos, Apr 19 '18 at 11:34
@PanagiotisKanavos I don't have a question here. I am asking the OP for information. — paparazzo, Apr 19 '18 at 11:36
@paparazzo on the contrary it does. If you have even a 50% overlap that's 5 million rows ort 25% of the target table. That's probably enough to escalate exclusive locks to the entire table. The operation will lock the affected rows for the entire operation too. — Panagiotis Kanavos, Apr 19 '18 at 11:36
@PanagiotisKanavos I do not have a question for you. Again I am asking the OP for information. I have not proposed any solution. — paparazzo, Apr 19 '18 at 11:39
@paparazzo I'm open to possible alternative joins, although right now I use the identifier column, I appreciate this is quite a generalised question, but I don't want people to concentrate on the micro level stuff here, deleting 10million rows (assuming 100% overlap) out of a 20 million row table is going to be a slow operation, its a matter of making it work in the minimal time. I'm happy to take suggestions for other ways the data might be joined/partitioned. — gmn, Apr 19 '18 at 11:42
The common column and value comes from the data. What is the primary key. If you would post the TSQL and query plan maybe we could help you. We don't know the other ways as you have not given specifics on what you are doing now. Why are you not posting the existing slow TSQL? — paparazzo, Apr 19 '18 at 11:47

score 4 · Answer 1 · answered Apr 19 '18 at 08:59

4

Since you deal with two large tables, in-place updates/inserts/merge can be time consuming operations. I would recommend to have some bulk logging technique just to load a desired content to a new table and the perform a table swap:

Example using SELECT INTO:

SELECT * 
INTO NewTableA
FROM (
SELECT * FROM dbo.TableB b WHERE NOT EXISTS (SELECT * FROM dbo.TableA a WHERE a.id = b.id)
UNION ALL
SELECT * FROM dbo.TableA a
) d

exec sp_rename 'TableA', 'BackupTableA'
exec sp_rename 'NewTableA', 'TableA'

Simple or at least Bulk-Logged recovery is highly recommended for such approach. Also, I assume that it has to be done out of business time since plenty of missing objects to be recreated on a new tables: indexes, default constraints, primary key etc.

answered Apr 19 '18 at 08:59

Alexander Volok

5,630
3
17
33

This is interesting, you might be onto something here, as its the update and delete which are costly operations, so creating a new table based on the two existing ones will probably be quicker – gmn Apr 19 '18 at 09:05
Thanks, just don't miss some secondary objects, like indexes, constraints, since they will not be copied by SELECT INTO. – Alexander Volok Apr 19 '18 at 09:30
Instead of renaming tables and recreating indexes you could create an identical staging table, fill it and use partition swapping to replace one data set with the other immediatelly - partition swapping is a metadata operation so its essentially online. Partition swapping between tables doesn't even require the use of partitions – Panagiotis Kanavos Apr 19 '18 at 10:52
Besides, the OP is trying to update the *big* table with data from the *small* table, not replace it. Swapping or renaming would lose the non-overlapping data – Panagiotis Kanavos Apr 19 '18 at 10:59
1

@PanagiotisKanavos, I think approach I've offered is already can be complicated and replacement of trivial table renames with a partition switching will not make it simpler. I believe you can just create your own answer and show a "right way". – Alexander Volok Apr 19 '18 at 11:17

score 1 · Answer 2 · answered Apr 19 '18 at 09:51

1

A Merge is probably your best bet, if you want to both inserts and updates.

MERGE #TableB AS Tgt
    USING (SELECT * FROM #TableA) Src
    ON (Tgt.Identifier = SRc.Identifier)  
    WHEN MATCHED THEN   
        UPDATE SET Date = Src.Date, ...
    WHEN NOT MATCHED THEN  
        INSERT (Identifier, Date, ...)  
        VALUES (Src.Identifier, Src.Date, ...);

Note that the merge statement must be terminated with a ;

answered Apr 19 '18 at 09:51

Peter

1,055
6
8

Clean syntax but I doubt it will address performance. Still +1 – paparazzo Apr 19 '18 at 11:49

Fastest options for merging two tables in SQL Server

2 Answers2

Linked