2

Table with million rows, two columns.

code | name
xyz  | product1
abc  | Product 2
...
...

I want to do insert in small batches (10000) via the insert into/select query.

How can we do this when there is no identity key to create a batch?

Blorgbeard
  • 101,031
  • 48
  • 228
  • 272
sam
  • 347
  • 2
  • 10
  • 26
  • First: why don't you insert all in one shot? Second: looks like you have many duplicate "code" so you can select distinct of codes into a temp table and then loop on the temp table and copy the records equal to the code in temp table. This may not be exactly 10000 in a batch though. – FLICKER Mar 03 '16 at 20:29
  • Is the two column table source or destination of the data ? – user957902 Mar 03 '16 at 20:33

2 Answers2

2

You could use a LEFT OUTER JOIN in your SELECT statement to identify records that are not already in the INSERT table, then use TOP to grab the first 10000 that the database finds. Something like:

INSERT INTO tableA
SELECT TOP 10000 code, name
FROM tableB LEFT OUTER JOIN tableA ON tableB.Code = tableA.Code
WHERE tableA.Code IS NULL;

And then run that over and over and over again until it's full.

You could also use Windowing functions to batch like:

INSERT INTO tableA
SELECT code, name
FROM (
       SELECT code, name, ROW_NUMBER() OVER (ORDER BY name)  as rownum         
       FROM tableB
     )
WHERE rownum BETWEEN 1 AND 100000;

And then just keep changing the BETWEEN to get your batch. Personally, if I had to do this, I would use the first method though since it's guaranteed to catch everything that isn't already in TableA.

Also, if there is the possibility that tableb will gain records during this batching process, then option 1 is definitely better. Essentially, with option2, it will determine the row_number() on the fly, so newly inserted records will cause records to be missed if they show up in the middle of batches.

If TableB is static, then Option 2 may be faster since the DB just has to sort and number the records, instead of having to join HUGE table to HUGE table and then grab 10000 records.

JNevill
  • 46,980
  • 4
  • 38
  • 63
  • Batching like this does two very important things. It keep the log from bloating out of control when there are millions and millions of rows. It also allows sql to allow other processes to continue in between executions. This latter one is extremely important in a production environment. You don't want to cripple your server for an hour and a half so you can copy some data around. – Sean Lange Mar 03 '16 at 20:27
  • That does make sense if you can't have zero downtime. I've worked in Corporate my entire career, so shoving in millions of records overnight is no biggie. – JNevill Mar 03 '16 at 20:28
  • I believe it will work for full load too when there is no data in table B – sam Mar 03 '16 at 20:31
  • Also, I just guessed at the join on the table. Probably you'd want to join across both code and product, or whatever the key is that works on both tables. That's obvious, but I figured it should be mentioned, otherwise you'll think you've fully loaded it, when in fact there are bunch of records missing. – JNevill Mar 03 '16 at 20:32
  • 1
    If it's a huge table-30 million rows- isnt it going to be a huge performance hit? – sam Mar 03 '16 at 20:40
  • we did let er'rip and it overran the log file – sam Mar 03 '16 at 20:42
  • Darn'd log file. I agree it depends on your system and your uptime needs. If your logfile is ballooning and filling your disk, then batching makes sense. Maybe if 10000 goes in lickity split and your log file is still in check, then consider upping it to 100000 or more until it feels rights. Otherwise you'll be submitting 3000 individual sql statements to copy all of this data. – JNevill Mar 03 '16 at 20:46
  • @JNevil second solution doesn't work http://stackoverflow.com/questions/13997177/why-no-windowed-functions-in-where-clauses – sam Mar 03 '16 at 22:02
  • Oh geez! Of course :( It would still work, it just needs to be moved up to the SELECT and, and then the whole statement needs to be a subquery so a `where` can be applied. I'm used to Teradata, where that thing could be stuck in a `QUALIFY` clause without the subquery. – JNevill Mar 04 '16 at 14:19
  • I've updated the code to reflect that change. Sorry about that :( – JNevill Mar 04 '16 at 14:20
  • but won't the subquery will fill up the log since it will run first and then we will do inserts in batches – sam Mar 04 '16 at 16:42
  • I'm not certain. If you ran a basic select query that had a million records, would that add more to the log than a basic select query that had 10 records? I'm not familiar enough with SQL server's logs, but I have to imagine that a `SELECT` statement would cause the same number of log records regardless of it's result set size. That just seems like a poor RDBMS design decision otherwise. – JNevill Mar 04 '16 at 16:52
  • I am going to try and will post the results here. Thanks JNevill – sam Mar 04 '16 at 17:12
  • We still have to run this in production but in test environment it did seem to work as expected – sam Mar 07 '16 at 20:30
  • @sam, do you remember if this worked (the first proposed left join solution) – ColinMac Aug 26 '19 at 16:11
0

You can do the pagination on SELECT and select the records by batch/page size of say 10000 or whatever you need and insert in the target table. In the below sample you will have to change the value of @Min and @Max for each iteration of the batch size you desire to have.

INSERT INTO EmployeeNew
SELECT Name
FROM
(
    SELECT DENSE_RANK OVER(ORDER BY EmployeeId) As Rank, Name
    FROM Employee
) AS RankedEmployee
WHERE Rank >= @Min AND Rank < @Max
Dilip Nannaware
  • 1,410
  • 1
  • 16
  • 24