Scalable approach to splitting and validating a fixed width or CSV datarow in SQL A.K.A. How can I avoid SSIS?

Question

I'm currently playing with different ways of getting data into SQL and yesterday hit a problem using BCP which although I solved reminded me of working with SSIS packages because of the not very useful error information. I feel that for the way I like to work I would be much happier loading entire datarows whether fixed-width or delimited into a staging table (using BCP or Bulk Insert) and then operating on the data rows rather than trying to force them into typed columns on the way in to SQL.

As such I would like to find an approach that would allow me to split and validate (check datatype) the data before I insert it into its destination and also write out any bad datarows to another table so I can then decide what to do with them.

I've knocked together a script to simulate the scenario, the importedData table would be the output of my BCP or BULK INSERT. All the data from ImportedData needs to end up either in the Presenters or the RejectedData tables.

I need an approach that could scale reasonably well, a real life situation might me more like 40 columns across with 20 million rows of data so I'm thinking I'll have to do something like process 10,000 rows at a time.

SQL Server 2012 has the new try_parse function which would probably help but I need to be able to do this on 2005 and 2008 machines.

IF OBJECT_ID (N'ImportedData', N'U') IS NOT NULL DROP TABLE dbo.ImportedData
CREATE TABLE dbo.ImportedData (RowID INT IDENTITY(1,1), DataRow VARCHAR(MAX))

IF OBJECT_ID (N'Presenters', N'U') IS NOT NULL DROP TABLE dbo.Presenters
CREATE TABLE dbo.Presenters (PresenterID INT, FirstName VARCHAR(10), LastName VARCHAR(10))

IF OBJECT_ID (N'RejectedData', N'U') IS NOT NULL DROP TABLE dbo.RejectedData
CREATE TABLE dbo.RejectedData (DataRow VARCHAR(MAX))

-- insert as fixed-width
INSERT INTO dbo.ImportedData(DataRow)
SELECT           '1  Bruce     Forsythe  '                               
UNION ALL SELECT '2  David     Dickinson '
UNION ALL SELECT 'X  BAD       DATA'                                  
UNION ALL SELECT '3  Keith     Chegwin   '                        

-- insert as CSV
/*INSERT INTO dbo.ImportedData(DataRow)
SELECT '1,Bruce,Forsythe'                               
UNION ALL SELECT '2,David,Dickinson'                                  
UNION ALL SELECT 'X,BAD,DATA'
UNION ALL SELECT '3,Keith,Chegwin' 
*/
---------- DATA PROCESSING -------------------------------

SELECT
    SUBSTRING(DataRow,1,3) AS ID,
    SUBSTRING(DataRow,4,10) AS FirstName,
    SUBSTRING(DataRow,14,10) AS LastName
FROM
    ImportedData


---------- DATA PROCESSING -------------------------------
SELECT * FROM ImportedData
SELECT * FROM Presenters
SELECT * FROM RejectedData

score 1 · Answer 1 · edited May 23 '17 at 12:27

For your 20M row scenario and concerns over performance, let's dive into that.

Step 1, load big file into database. The file system is going to disk and read all that data up. Maybe you're sitting on banks of Fusion-io drives and iops is not a concern but baring that unlikely scenario, you will spend X amount of time reading that data off disk via bcp/bulk insert/ssis/.net/etc. You then get to spend time writing all of that same data back to disk in the form of the table insert(s).

Step 2, parse that data. Before we spend any CPU time running those substring operations, we'll need to identify the data rows. If your machine is well provisioned on RAM, then the data pages for ImportedData might be in memory and it will be far less costly to access them. Odds are though, they aren't all in memory so a combination of logical and physical reads will occur to get that data. You've now effectively read that source file in twice for no gain.

Now time to start splitting your data. Out of the box, TSQL will give you trims, left, right, and substring methods. With CLR, you can get some wrapper methods to the .NET string library to help simplify the coding efforts but you'll trade some coding efficiencies with instantiation costs. Last I read on the matter the answer was (tsql vs clr) was "it depends." Shocking if you know the community but it really depends on your string lengths and a host of factors.

Finally, we're ready to parse the values and see whether it's a legit value. As you say, with SQL 2012, we have try_parse as well as try_convert. Parse is completely new but if you need to deal with locale aware data (01-02-05 In GB, it's Feb 1, 2005. In US, it's Jan 2, 2005. In JP, it's Feb 5, 2001) it's invaluable. If you're not on 2012, you could roll your own versions with a CLR wrapper.

Step 3, Errors! Someone slipped in a bad date or whatever and your cast fails. What happens? Since the query either succeeds or it doesn't, all of the rows fail and you get ever so helpful error messages like "string or binary data would be truncated" or "conversion failed when converting datetime from character string." Which row out of your N size slice? You won't know until you go looking for it and this is when folks usually devolve into an RBAR approach further degrading performance. Or they try and stay set based but run repeated queries against the source set filtering for scenarios that will fail the conversion before attempting the insert.

You can see by my tags, I'm an SSIS kind of guy but I am not such a bigot to think it's the only thing that can work. If there's an approach for ETL, I like to think I've tried it. In your case, I think you will get much better performance and scalability by developing your own ETL code/framework or using an existing one (rhino or reactive)

Finally finally, be aware of the implications of varchar(max). It has a performance cost associated to it.

Also, as described, your process would only allow for a single instance of the ETL to be running at once. Maybe that covers your use case but in companies where we did lots of ETL, we couldn't force client B to wait for client A's etl to finish processing before starting their work or we'd be short of clients in no time at all.

thanks for this Bill, lots of very useful information in there, I need some time to fully digest it before I come back with some questions! — samaspin, Aug 29 '12 at 11:17

score 0 · Accepted Answer · answered Aug 27 '12 at 19:34

There is no simple way of doing it in T-SQL.in this case you need to have isdate() ,isnumeric() type of UDF for all the datatypes you will try to parse. then you can move the rejected one to rejected Table ,delete those rows from importeddate and then continue with your load..

SELECT 
RecordID,
SUBSTRING(DataRow,1,3) AS ID, 
SUBSTRING(DataRow,4,10) AS FirstName, 
SUBSTRING(DataRow,14,10) AS LastName,
SUBSTRING(DataRow,24,8) AS DOB,
SUBSTRING(DataRow,32,10) AS Amount,     
INTO RejectedData 
FROM ImportedData
WHERE  ISDATE(SUBSTRING(DataRow,24,8))= 0 
OR ISNUMERIC(SUBSTRING(DataRow,32,10))=0

then delete from imported data

DELETE FROM ImportedData WHERE RecordID IN (SELECT RecordID FROM RejectedData )

and then insert into presenter

INSERT INTO Presenters     
SELECT 
RecordID,
SUBSTRING(DataRow,1,3) AS ID, 
SUBSTRING(DataRow,4,10) AS FirstName, 
SUBSTRING(DataRow,14,10) AS LastName,
CONVERT(Date,SUBSTRING(DataRow,24,8)) AS DOB,
CONVERT(DECIMAL(18,2),SUBSTRING(DataRow,32,10)) AS Amount,  
FROM ImportedData

and for managing batches in inserts this is a very good article.

http://sqlserverplanet.com/data-warehouse/transferring-large-amounts-of-data-using-batch-inserts

Scalable approach to splitting and validating a fixed width or CSV datarow in SQL A.K.A. How can I avoid SSIS?

2 Answers2