Split one large, denormalized table into a normalized database

Question

I have a large (5 million row, 300+ column) csv file I need to import into a staging table in SQL Server, then run a script to split each row up and insert data into the relevant tables in a normalized db. The format of the source table looks something like this:

(fName, lName, licenseNumber1, licenseIssuer1, licenseNumber2, licenseIssuer2..., specialtyName1, specialtyState1, specialtyName2, specialtyState2..., identifier1, identifier2...)

There are 50 licenseNumber/licenseIssuer columns, 15 specialtyName/specialtyState columns, and 15 identifier columns. There is always at least one of each of those, but the remaining 49 or 14 could be null. The first identifier is unique, but is not used as the primary key of the Person in our schema.

My database schema looks like this

People(ID int Identity(1,1))
Names(ID int, personID int, lName varchar, fName varchar)
Licenses(ID int, personID int, number varchar, issuer varchar)
Specialties(ID int, personID int, name varchar, state varchar)
Identifiers(ID int, personID int, value)

The database will already be populated with some People before adding the new ones from the csv.

What is the best way to approach this?

I have tried iterating over the staging table one row at a time with select top 1:

WHILE EXISTS (Select top 1 * from staging)
BEGIN
    INSERT INTO People Default Values
    SET @LastInsertedID = SCOPE_IDENTITY() -- might use the output clause to get this instead

    INSERT INTO Names (personID, lName, fName) 
    SELECT top 1 @LastInsertedID, lName, fName from staging

    INSERT INTO Licenses(personID, number, issuer)
    SELECT top 1 @LastInsertedID, licenseNumber1, licenseIssuer1 from staging

    IF (select top 1 licenseNumber2 from staging) is not null
    BEGIN
        INSERT INTO Licenses(personID, number, issuer)
        SELECT top 1 @LastInsertedID, licenseNumber2, licenseIssuer2 from staging
    END

    -- Repeat the above 49 times, etc...

    DELETE top 1 from staging
END

One problem with this approach is that it is prohibitively slow, so I refactored it to use a cursor. This works and is significantly faster, but has me declaring 300+ variables for Fetch INTO.

Is there a set-based approach that would work here? That would be preferable, as I understand that cursors are frowned upon, but I'm not sure how to get the identity from the INSERT into the People table for use as a foreign key in the others without going row-by-row from the staging table.

Also, how could I avoid copy and pasting the insert into the Licenses table? With a cursor approach I could try:

FETCH INTO ...@LicenseNumber1, @LicenseIssuer1, @LicenseNumber2, @LicenseIssuer2...
INSERT INTO #LicenseTemp (number, issuer) Values
(@LicenseNumber1, @LicenseIssuer1),
(@LicenseNumber2, @LicenseIssuer2),
... Repeat 48 more times...
.
.
.
INSERT INTO Licenses(personID, number, issuer)
SELECT @LastInsertedID, number, issuer
FROM #LicenseTEMP
WHERE number is not null

There still seems to be some redundant copy and pasting there, though.

To summarize the questions, I'm looking for idiomatic approaches to:

Break up one large staging table into a set of normalized tables, retrieving the Primary Key/identity from one table and using it as the foreign key in the others
Insert multiple rows into the normalized tables that come from many repeated columns in the staging table with less boilerplate/copy and paste (Licenses and Specialties above)

Short of discreet answers, I'd also be very happy with pointers towards resources and references that could assist me in figuring this out.

It's probably impractical to modify the csv file (opening it in excel is impossible, and it is ugly to inspect and edit in text editors), but the staging table could be modified, sure. — Mike Nitchie, Nov 06 '16 at 20:42
I think you could add the People ID directly in your staging table rather than in your procedures, then for Licences, maybe a while() that build the insert sentence dynamicly for all NOT NULL records. — McNets, Nov 06 '16 at 20:53
Not for `fname/lname` but there is a unique identifier associated with each `Person` (though it is not used as the Primary Key of the Person in our db). I just edited my question to reflect this. — Mike Nitchie, Nov 06 '16 at 20:54
@mcNets - I considered that, but that would still require an iteration over the staging table, correct? If I'm iterating over it once, it makes sense to do all necessary inserts at that time, rather than do one iteration followed by set operations for inserts into other tables. — Mike Nitchie, Nov 06 '16 at 20:55
@MikeNitchie, no, you only need read each record once, then call a procedure that builds Licences, Specialites, and Identifiers. But 5 million records are a lot. I think it's a good idea to mark every record, just to avoid to repeat all the process. — McNets, Nov 06 '16 at 21:04
I agree with @mcNets: go for the set operations over cursor and row-by-row logic. You should strive to access the staging table once for every single target (i.e. 5 times). — Insac, Nov 06 '16 at 21:06
Thanks.. another question, just to be sure. You wrote `Names(ID int, personID int, lName varchar, fName varchar)`: what is the first ID column for? Shouldn't personID be sufficient? — Insac, Nov 06 '16 at 21:21
The first ID is the primary key of the Name. We decided they should have their own ID rather than using a combination of other columns for a primary key — Mike Nitchie, Nov 06 '16 at 21:24

score 3 · Accepted Answer · edited May 23 '17 at 12:24

Ok, I'm not an SQL Server expert, but here's the "strategy" I would suggest.

Calculate the personId on the staging table As @Shnugo suggested before me, calculating the personId in the staging table will ease the next steps

Use a sequence for the personID From SQL Server 2012 you can define sequences. If you use it for every person insert, you'll never risk an overlapping of IDs. If you have (as it seems) personId that were loaded before the sequence you can create the sequence with the first free personID as starting value

Create a numbers table Create an utility table keeping numbers from 1 to n (you need n to be at least 50.. you can look at this question for some implementations)

Use set logic to do the insert I'd avoid cursor and row-by-row logic: you are right that it is better to limit the number of accesses to the table, but I'd say that you should strive to limit it to one access for target table.

You could proceed like these:

People:

 INSERT INTO People (personID) 
 SELECT personId from staging;

Names:

 INSERT INTO Names (personID, lName, fName) 
 SELECT personId, lName, fName from staging;

Licenses: here we'll need the Number table

 INSERT INTO Licenses (personId, number, issuer)
 SELECT * FROM (
    SELECT personId, 
           case nbrs.n 
                when 1 then licenseNumber1 
                when 2 then licenseNumber2
                ...
                when 50 then licenseNumber50
            end as licenseNumber,    
           case nbrs.n 
                when 1 then licenseIssuer1 
                when 2 then licenseIssuer2
                ...
                when 50 then licenseIssuer50
            end as licenseIssuer
      from staging 
           cross join 
           (select n from numbers where n>=1 and n<=50) nbrs
  ) WHERE licenseNumber is not null;

Specialties:

 INSERT INTO Specialties(personId, name, state)
 SELECT * FROM (
    SELECT personId, 
           case nbrs.n 
                when 1 then specialtyName1
                when 2 then specialtyName2
                ...
                when 15 then specialtyName15
            end as specialtyName,    
           case nbrs.n 
                when 1 then specialtyState1
                when 2 then specialtyState2
                ...
                when 15 then specialtyState15
            end as specialtyState
      from staging 
           cross join 
           (select n from numbers where n>=1 and n<=15) nbrs
 ) WHERE specialtyName is not null;

Identifiers:

 INSERT INTO Identifiers(personId, value)
 SELECT * FROM (
    SELECT personId, 
           case nbrs.n 
                when 1 then identifier1
                when 2 then identifier2
                ...
                when 15 then identifier15
            end as value
      from staging 
           cross join 
           (select n from numbers where n>=1 and n<=15) nbrs
 ) WHERE value is not null;

Hope it helps.

Thanks for this, it ended up working great. The one change I made is that I used the `UNION` approach suggested by Shnugo instead of a numbers table — Mike Nitchie, Nov 12 '16 at 16:31

score 1 · Answer 2 · edited Jun 20 '20 at 09:12

You say: but the staging table could be modified

I would

add a PersonID INT NOT NULL column and fill it with DENSE_RANK() OVER(ORDER BY fname,lname)
add an index to this PersonID
use this ID in combination with GROUP BY to fill your People table
do the same with your names table
And then use this ID for a set-based insert into your three side tables

Do it like this

SELECT AllTogether.PersonID, AllTogether.TheValue
FROM
(
           SELECT PersonID,SomeValue1 AS TheValue FROM StagingTable
 UNION ALL SELECT PersonID,SomeValue2             FROM StagingTable
 UNION ALL ... 
) AS AllTogether
WHERE AllTogether.TheValue IS NOT NULL

UPDATE

You say: might cause a conflict with IDs that already exist in the People table

You did not tell anything about existing People...

Is there any sure and unique mark to identify them? Use a simple

UPDATE StagingTable SET PersonID=xyz WHERE ...

to set existing PersonIDs into your staging table and then use something like

UPDATE StagingTable 
SET PersonID=DENSE RANK() OVER(...) + MaxExistingID
WHERE PersonID IS NULL

to set new IDs for PersonIDs still being NULL.

Good suggestion. However, `DENSE_RANK()` might cause a conflict with IDs that already exist in the `People` table. Is there a way to add PersonIDs to the staging table that we know don't already exist in the `People` table? — Mike Nitchie, Nov 06 '16 at 21:00

Split one large, denormalized table into a normalized database

2 Answers2

UPDATE