1

I get a file from a particular vendor every week. Once in a while their process changes, and I wind up getting extra fields, sometimes in the middle of the row. Fortunately, I always get a proper header, and they don't appear to remove the existing fields. Unfortunately, nobody ever knows WHY or how it changed. After changing it several times, I decided to get clever.

So, what I'd like to do: query SQL Server, get the columns from the target view/table, then SELECT just the fields that exist in both, into the table.

My current code (below) doesn't do that - it just does an import-csv, "select the fields I know are good", then insert into a table. Works, yes, but slow as crap. And dumb, since I have to hardcode the fields I know are good. I'd like a more extensible process.

Additionally, I'd like to stream it in - my existing process crawls (15 minutes for a 50mb file), probably because it has to load the entire file, run it through the select & convert, then do the insert. But it works.

I played with High-Performance Techniques for Importing CSV to SQL Server using PowerShell, but kept running into "I don't know powershell well enough to fix" issues.

Here's what I have so far. Ultra-simplified, though I could make it simpler by building a view on top of it that does the conversion from "That First Field to FirstField". Thanks in advance!

In SQL Server:

CREATE DATABASE my_database
go
USE my_database
go
CREATE TABLE my_target_table
(FirstField VARCHAR(100), StoreId INT,State VARCHAR(3))

Now the powershell script:

#import all the stuff we need.  Using These are Chad Miller's datatable scripts.  
#http://gallery.technet.microsoft.com/ScriptCenter/7985b7ef-ed89-4dfd-b02a-433cc4e30894/
#http://gallery.technet.microsoft.com/ScriptCenter/2fdeaf8d-b164-411c-9483-99413d6053ae/
. C:\scripts\out-datatable.ps1;
. C:\scripts\invoke-sqlcmd2.ps1;
. C:\scripts\write-datatable.ps1;

if (test-path Variable:\my_import) {
remove-variable my_import
}

$ImportFilePath = "c:\test\my_import_file.txt"

#save the below 3 lines as "c:\test\my_import_file.txt"
#That First Field|Store Id|UselessField|State
#1|15|AAA|MI
#12|4|AB|RI

$my_import = import-csv $ImportFilePath -Delimiter "|"| `
select @{Label="FirstField";Expression={$_."That First Field"}}, `
       @{Label="StoreId";Expression={$_."Store ID"}}, `
       @{Label="State";Expression={$_."State"}} `
|out-datatable       

Write-DataTable -ServerInstance my_server -Database my_database -TableName my_target_table -Data $my_import
Solomon Rutzky
  • 46,688
  • 9
  • 128
  • 171
mbourgon
  • 1,286
  • 2
  • 17
  • 35
  • 2
    Because this code works, it would be a better fit for [CodeReview.SE](http://codereview.stackexchange.com/). – briantist Jun 01 '16 at 21:52
  • Yes, but what I'm looking for is, in all likelihood, an entirely different approach. What I'm doing here won't work for what I want to do. – mbourgon Jun 02 '16 at 12:18

1 Answers1

0

I can't say about PowerShell as I don't know enough about it, but I can say that using a DataTable is not terribly efficient because it has to load the entire dataset into memory first, before releasing anything to SQL Server. That alone is not truly scalable.

Give that you want to:

  1. Select a subset of the fields in the CSV file
  2. Stream the data in
  3. Be as dynamic / flexible as possible
  4. Import 50 mb in far less than 15 minutes

you should be able to accomplish all of this using Table-Valued Parameters and .NET.

Step 1 of getting a TVP to work is to create a User-Defined Table-Type (UDTT). This is the strongly-typed structure used to transport the data from .NET (and maybe PowerShell as well) into SQL Server. It provides the definition of what will be a table variable in a stored procedure or ad hoc query that receives the data. UDTTs can even have IDENTITY columns, CHECK Constraints, UNIQUE Constraints, etc.

Step 2 is writing code in .NET where the key component is creating a method that returns IEnumerable<SqlDataRecord>. This method will be passed into the parameter of the SqlCommand that maps to the User-Defined Table-Type, rather than passing in a DataTable for the UDTT. While the DataTable will stream out the data to SQL Server, same as this method that you will create, the custom method allows for streaming the data into the app. This means that for each row you read, you can send it immediately, and then read another row. So the memory footprint is just a single row.

The other nice thing about this approach is that since the code that reads the file is custom, you can implement any logic you want regarding which fields to grab from the CSV file. You can connect to SQL Server, get the columns in a particular table, and map just those. The down-side, however, is that the TVP (i.e. the User-Defined Table-Type / UDTT) is not dynamic and not easily changeable. Which should be ok since I doubt that the table definitions are changing that often anyway. So you really are mapping to each particular table, more so than mapping to the columns in the table. You just need a naming convention for the UDTTs that makes it easy to correlate them to their associated target table. But then you still have the ability to query the DB for the definition of the UDTT so that you can dynamically build the SqlMetaDataCollection that provides the structure of the SqlDataRecord that is the .NET mapping to the UDTT.

I have example code of most of this (all but the dynamic fetching of the SqlDataRecord definition) in the following answer, also here on S.O.:

How can I insert 10 million records in the shortest time possible?

Community
  • 1
  • 1
Solomon Rutzky
  • 46,688
  • 9
  • 128
  • 171