T-SQL update top n rows for each group with n variable for each group (cross apply alternative)

Question

I have two tables - SourceTable and DestTable - Source contains columns RowNum and Col. I want to join the two tables on Col and then update top 'n' rows of DestTable where 'n' is the value of RowNum in SourceTable. I'm able to do this with cross apply but it takes hours. I'm looking for another way of writing this query.

My Cross Apply solution is -

UPDATE t 
SET    dest = source 
FROM   #sourcetable s 
       CROSS apply (SELECT TOP(s.rownum) * 
                    FROM   #desttable d 
                    WHERE  d.col = s.col) t

Below is sample data -

SourceTable -

Col, Source, RowNum    
11 , 111   , 2    
12 , 222   , 1

DestTable (before update) -

Id, Col, Dest    
1 , 11 , 0
2 , 11 , 0
3 , 11 , 0    
4 , 12 , 0    
5 , 12 , 0

DestTable (after update) -

Id, Col, Dest    
1, 11, 111    
2, 11, 111    
3, 11, 0    
4, 12, 222    
5, 12, 0

But this performs extremely poorly (even on a properly indexed table, it takes hours when the number of rows are a few millions and columns close to 400).

Is there a more efficient way to write this query?

Update - Actual Table Structure is as follows -

DestTable (contains about 1 million records and every column is populated) - 38 datetime, 66 float, 31 int, 210 varchar(50) fields

Sourceable (it's a temp table created before the sql in question runs, contains about 100k records and every column is populated) - 16 fields total (4 int, 2 datetime, 1 float, rest varchar(50))

Indexes - I can't create a covering index that includes all of the required fields because they vary from query to query. I've tried creating indexes on various combinations of fields and found that the update runs significantly faster without any indexes.

Current test results (row counts are in DestTable) -

1000 rows (without any index) - 05:00 minutes

10000 rows (without any index) - 17:38 minutes

100000 rows (without any index) - 3 hours 7 minutes

100000 (with any index on 4 fields) - Over 7 hours when I had to stop it

I can understand millions of rows make this slower, but you need to explain how multiple column affect this. Also Performance questions should include `EXPLAIN ANALYZE` and some information about table size, index, current time performance, desire time, etc. `Slow` is a relative term and we need a real value to compare. — Juan Carlos Oropeza, Jul 13 '16 at 18:18
Thank you, I added the information regarding the table structure and current performance. Desired time is less than 20 minutes for 1 million rows. I'm not sure what Explain Analyze is. Is it a tag or something that should be added to the question? — Achilles, Jul 14 '16 at 13:06
It's almost always a wise thing to split large updates into smaller ones, because.. Well, large transactions tend to be slow. Did you try doing it with a cursor, one col at the time? — dean, Jul 14 '16 at 13:55
This is [**EXPLAIN PLAN**](http://stackoverflow.com/questions/7359702/how-do-i-obtain-a-query-execution-plan) ... btw if 1000 record are 5 min is very reasonable say 100k take 3h. Looks like time increase is linear to the number of rows. So if you improve your 1000 rows time the other will also improve. — Juan Carlos Oropeza, Jul 14 '16 at 14:04
Got it. I'm updating only one column. Let me ask you something else - Is it a good idea to spawn multiple threads to update different rows simultaneously (one thread for first thousand rows, another thread for the next thousand rows and so on)? I don't know how database works internally, not sure if this will help or make things worse. — Achilles, Jul 14 '16 at 15:53

score 2 · Accepted Answer · answered Jul 13 '16 at 18:31

2

You could try to use simple JOIN instead of correlated subquery:

WITH cte AS
(
 SELECT d.id, d.col, d.dest, s.source
 FROM (SELECT *,
       rn = ROW_NUMBER() OVER(PARTITION BY col ORDER BY id) FROM #desttable) d
 JOIN #sourcetable s
   ON d.col = s.col
  AND d.rn <= s.rownum 
)
UPDATE cte
SET dest = source;

SELECT *
FROM #desttable;

LiveDemo

You should post your real data sample, data structures and query plans. Otherwise we could only guess how to improve it.

answered Jul 13 '16 at 18:31

Lukasz Szozda

162,964
23
234
275

@JuanCarlosOropeza :) I've used http://www.convertcsv.com/csv-to-sql.htm to get tables – Lukasz Szozda Jul 13 '16 at 18:37
That doesnt have to choose other rdbms as output? I use the `text to ddl` function from sqlFiddle – Juan Carlos Oropeza Jul 13 '16 at 19:06
@lad2025 Thank you, your solution works faster than cross apply. The execution time went down from over 7 hours to 4-6 hours. I'm still trying to optimize it though. Updating just a million records shouldn't take so long. – Achilles Jul 14 '16 at 13:10
@Achilles Instead `UPDATE` try `CREATE TABLE dummyDest as SELECT ...` If is faster you can `DROP TABLE` and `ALTER TABLE RENAME` – Juan Carlos Oropeza Jul 14 '16 at 15:53

T-SQL update top n rows for each group with n variable for each group (cross apply alternative)

1 Answers1