1

I have a stored procedure that ETL's data from all over the place, and at the end attempting to Join it all together into one single large table. Talking about 100 columns wide, 35 million rows.

When I pull from the temp tables & join it all together the insert query can take hours, page to disk, etc. It's simply too large for my environment.

Insert Into tbl_huge
    Select Distinct a, b, c, d, e, f, g, h, i, j, k
    from (Mess of Subqueries & #tmp_tbls) 

What's the best way to batch this insert to do the commits 100k rows or something at a time? There's no good natural key in the data that breaks this up even semi-evenly, and I'm worried about that being true going-forward anyway.

I've seen different examples using where not exists in the destination table, but this seems like the wrong approach that won't continue to scale & grow.

So what's the best approach here? Sort the result and re-do the Select & insert query multiple times in a while loop keeping a counter to know i need to do rows > x?

Is there a better approach, either that allows me to accurately select a subset to insert OR that pre-processes the select into memory (+page file due to size) and read it back to insert in chunks?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
DiscontentDisciple
  • 466
  • 1
  • 6
  • 20
  • One of the first things I'd look at is if you can reorganize your `(Mess of Subqueries & #tmp_tbls)` in a way that could eliminate the need for the `SELECT DISTINCT`. If anything is causing you to eat up resources, that would be it. – daShier Aug 28 '19 at 03:16
  • My strong advice is [don’t use a stored procedure](https://stackoverflow.com/questions/6368985/mysql-stored-procedures-use-them-or-not-to-use-them/6369030#6369030), but rather do the work of assembling the data changes outside the DB using the app language of your choice, then send in 10K rows at a time – Bohemian Aug 28 '19 at 03:35

2 Answers2

1

ETL is "Extract, Transform and Load". You're doing all of this in one step, where it needs to be 3 steps.

You really need to "LAND" (Extract) the data before Transforming it, and then loading into a single humongous table.

Create fixed tables (not #temp) for your (Mess of Subqueries & #tmp_tbls) to extract your data into.
Load the raw data into these tables (if all your data is on the one server or in the same database, you could skip this step)

Then Transform the data into the "shape" it needs for your insert.

When that is done, load it.

If any of those steps take too long, you could look at chunking the individual step

It takes a lot more work, but more robust.

Edit: Often when loading data in this way, its based on a date (ie move all of yesterdays data into the warehouse), if thats part of the issue here, you could run the load more often. Like hourly or even every 5 mins

Trubs
  • 2,829
  • 1
  • 24
  • 33
  • Even if I land all the data, I'm unsure how to chunk an insert when done from a bunch of joins? I'll have basically the same issue, won't I? – DiscontentDisciple Aug 28 '19 at 15:09
  • That's a valid point, but tbh we _dont know_ if it will be neccessary. it could be that loading the data into the temp table takes all the time and inserting from a temp table will be super quick? (although a wide table with 35M rows might be problematic) But by splitting it out into mulitple statements and steps, you can isolate where the problems are and tune those parts of the query. – Trubs Aug 29 '19 at 04:10
0

You can follow below steps to do the insert in batches

  1. Create a temporary table with RowNumber identity column to have the below resultset.
CREATE TABLE #tempTableToHoldResultSet(RowNumber Int IDENTITY(1,1), OtherColumns...)
  1. Insert the resultset into temporary table created in step no. 1
INSERT INTO #tempTableToHoldResultSet(othercolumns...)
SELECT DISTINCT a, b, c, d ... FROM (Mess of Subqueries & #tmp_tbls) 
  1. Now, do the Batch Insert using While Loop. You can use the TRY CATCH TRANSACTION management using the Stackoverflow link
DECLARE @minRowNumber INT = 1
DECLARE @batchsize INT = 10000 
DECLARE @maxRowNumber INT
SET @maxRowNumber = @minRowNumber + @batchSize 

WHILE EXISTS (SELECT * FROM #tempTableToHoldResultSet WHERE RowNumber >=  
  @minRowNumber AND RowNumber < @maxRowNumber)
BEGIN 
SET XACT_ABORT, NOCOUNT ON

DECLARE @starttrancount int

BEGIN TRY
    SELECT @starttrancount = @@TRANCOUNT

    IF @starttrancount = 0
        BEGIN TRANSACTION

INSERT INTO tbl_huge
SELECT a,b,c,d... FROM #tempTableToHoldResultSet     
WHERE RowNumber >= @minRowNumber AND RowNumber < @maxRowNumber;
SET @minRowNumber = @maxRowNumber
SET @maxRowNumber = @minRowNumber + @batchSize

IF @starttrancount = 0 
        COMMIT TRANSACTION
END TRY
BEGIN CATCH
    IF XACT_STATE() <> 0 AND @starttrancount = 0 
        ROLLBACK TRANSACTION;
    THROW;
    --before SQL Server 2012 use 
    --RAISERROR [rethrow caught error using @ErrorNumber, @ErrorMessage, etc]
END CATCH
END
GO
  1. After data load is completed, you can DROP the temporary table.
-- Drop the temporary table, after data load is completed
DROP TABLE #tempTableToHoldResultSet 
GO
Venkataraman R
  • 12,181
  • 2
  • 31
  • 58