2

I'm currently trying to import a CSV file of 20gbs (roughly 64 million rows, 58 columns) to a mssql database.

First I tried to do it with SSIS but is was so slow I decided to try using Powershell instead and found a nice query here:

High performance import of csv

The query is very quick, I manage to insert roughly 1 million rows per minute. However , I need to be able to handle delimiters embedded in quotes, like this: Column1,"Car,plane,boat",Column3

I did it using regex by the authors recommendation by switching:

 $null = $datatable.Rows.Add($line.Split($csvdelimiter))

to:

 $null = $datatable.Rows.Add($([regex]::Split($line, $csvSplit, $regexOptions)))

Full query:

# Database variables
$sqlserver = "server"
$database = "database"
$table = "tablename"

# CSV variables
$csvfile = "filepath"
$csvdelimiter = ","
$firstRowColumnNames = $true
$fieldsEnclosedInQuotes = $true

# Handling of regex for comma problem
if ($fieldsEnclosedInQuotes) {
    $csvSplit = "($csvdelimiter)"
    $csvsplit += '(?=(?:[^"]|"[^"]*")*$)'
} else { $csvsplit = $csvdelimiter }



$regexOptions = [System.Text.RegularExpressions.RegexOptions]::ExplicitCapture


################### No need to modify anything below ###################
Write-Host "Script started..."
$elapsed = [System.Diagnostics.Stopwatch]::StartNew() 
[void][Reflection.Assembly]::LoadWithPartialName("System.Data")
[void][Reflection.Assembly]::LoadWithPartialName("System.Data.SqlClient")

# 50k worked fastest and kept memory usage to a minimum
$batchsize = 50000

# Build the sqlbulkcopy connection, and set the timeout to infinite
$connectionstring = "Data Source=$sqlserver;Integrated Security=true;Initial Catalog=$database;"
$bulkcopy = New-Object Data.SqlClient.SqlBulkCopy($connectionstring, [System.Data.SqlClient.SqlBulkCopyOptions]::TableLock)
$bulkcopy.DestinationTableName = $table
$bulkcopy.bulkcopyTimeout = 0
$bulkcopy.batchsize = $batchsize

# Create the datatable, and autogenerate the columns.
$datatable = New-Object System.Data.DataTable

# Open the text file from disk
$reader = New-Object System.IO.StreamReader($csvfile)
$firstline = (Get-Content $csvfile -First 1)
$columns = [regex]::Split($firstline, $csvSplit, $regexOptions)


if ($firstRowColumnNames -eq $true) { $null = $reader.readLine() }

foreach ($column in $columns) { 

    $null = $datatable.Columns.Add()
}

# Read in the data, line by line
while (($line = $reader.ReadLine()) -ne $null)  {
    $null = $datatable.Rows.Add($([regex]::Split($line, $csvSplit, $regexOptions)))
    $i++; if (($i % $batchsize) -eq 0) { 
        $bulkcopy.WriteToServer($datatable) 
        Write-Host "$i rows have been inserted in $($elapsed.Elapsed.ToString())."
        $datatable.Clear() 
    } 
} 

 # add in all the remaining rows since the last clear
 if($datatable.rows.count -gt 0) {
 $bulkcopy.writetoserver($datatable)
 $datatable.clear()
 }

# Clean Up
$reader.Close(); $reader.Dispose()
$bulkcopy.Close(); $bulkcopy.Dispose()
$datatable.Dispose()

Write-Host "Script complete. $i rows have been inserted into the database."
Write-Host "Total Elapsed Time: $($elapsed.Elapsed.ToString())"
# Sometimes the Garbage Collector takes too long to clear the huge datatable.
 [System.GC]::Collect()
pause

It takes much longer time with the regex:

24 seconds per 50 000k rows (with handling of delimiters embedded in qoutes)

2 seconds per 50 000k rows (without the handling)

Am I doing something wrong? Is regex the correct way to this? Can I improve the query performance in any way or is this performance lost something I have to accept?

Update: Added fully query

holder
  • 585
  • 2
  • 9
  • 1
    *"However , I need to be able to handle delimiters embedded in quotes, like this: "Car,plane,boat""* What do you mean? If the delimiter is inside quotes, it isn't a delimiter. Also, you said that SSIS was slow. How did you try loading it? What version of SQL Server, have you considered using `BULK INSERT` if you're on SQL Server 2017+? – Thom A Mar 31 '20 at 09:06
  • 2
    if SSIS was slow, I wonder if your SSIS was ... wrong; SSIS is usually pretty good for import; frankly this sounds like a job for BCP at the command line (or `BULK INSERT` in SSMS); alternatively, `FastCsvReader` couples nicely with `SqlBulkCopy` for a bulk import API with full CSV support from .NET – Marc Gravell Mar 31 '20 at 09:06
  • Exactly. Maybe I was a bit unclear. I mean that "Car,plane,boat" should not be split up and treated as one column and not three. – holder Mar 31 '20 at 09:07
  • @MarcGravell Thanks for your input, yeah maybe you're right. I do use SqlBulkCopy to insert the data to the db. It's when I add the handling of the delimiters embedded in quotes that the performance drop significantly – holder Mar 31 '20 at 09:10
  • 1
    SSIS supports quoted strings, and like @MarcGravell said, it's actually pretty fast with CSVs as it can process them as a Bulk operation. Also SQL Server supports quoted strings with the `BULK INSERT` operator from SQL Server 2017 as well (it has the added `FORMAT` and `FIELDQUOTE` options). – Thom A Mar 31 '20 at 09:17
  • @Larnu. I'm affraid we use sql server 2016.. Does that eliminate the possibility using the BULK INSERT option? – holder Mar 31 '20 at 09:34
  • 1
    Don't estimate the power of the PowerShell pipeline. For measuring the performance against the PowerShell pipeline capabilities, you can't just take a part of the the solution as [**the performance of a complete (PowerShell) solution is supposed to be better than the sum of its parts**](https://stackoverflow.com/a/59437162/1701026). To correctly stream each CSV entry to your SQL server, you should **(open the connection) `ConvertFrom-CSV .\data.csv | Foreach-Object {}` (close connection)**. – iRon Mar 31 '20 at 11:46
  • Meaning, instead first loading the whole csv into memory (and lose time) immediately start uploading each record. Besides the fact this is also much easier (and note that e.g. quotes are interpreted while data is being uploaded) it also consumes far less memory. – iRon Mar 31 '20 at 12:28
  • @iRon Thanks. The query loads 50 000 rows into the data table (in memory) and then bulk insert it and clear the table and continues on until there are no more rows.. Maybe my question is a bit unclear, It's not the insert that is slow it is the regex split ( which I use to handle values that have the delimiter in it). My question is that if there is any faster way to handle this? First I used this (FAST!): $null = $datatable.Rows.Add($line.Split($csvdelimiter)) Now I use this (SLOW): $datatable.Rows.Add($([regex]::Split($line, $csvSplit, $regexOptions))) – holder Mar 31 '20 at 12:34

2 Answers2

2

For large CSVs I would use Microsoft.VisualBasic.FileIO.TextFieldParser. All the parsing (advanced, see example) is done there effectively.

Do not worry about "VisualBasic", it's part of the .NET. The assembly should be added explicitly and that's it.

Here is the how-to sample with some comments

    # temp data
    Set-Content z.csv @'
column1,column2,column3
"data,
""1a""",data2a,data3a


data1b,  data2b  ,data3b
'@

    Add-Type -AssemblyName Microsoft.VisualBasic

    $reader = New-Object Microsoft.VisualBasic.FileIO.TextFieldParser $PSScriptRoot\z.csv #! full path
    $reader.SetDelimiters(',') # default is none
    $reader.TrimWhiteSpace = $false # default is true

    while(!$reader.EndOfData) {
        $reader.LineNumber #! it counts not empty lines
        $reader.ReadFields() | %{ "data: '$_'" }
    }
    $reader.Close()

    Remove-Item z.csv
Roman Kuzmin
  • 40,627
  • 11
  • 95
  • 117
  • Thanks for your reply. Will this : "Car,plane,boat" be treated as one or three values using your query? I cannot see anything treating the quotations – holder Mar 31 '20 at 10:15
  • 1
    Yes, it will be one value. The second record in my example contains: comma, new line, doubled (escaped) quotes. They are all handled. See "data,...1a...". – Roman Kuzmin Mar 31 '20 at 10:20
  • Ok! Thanks. Do you think this will be faster than System.IO.StreamReader that i use today ? – holder Mar 31 '20 at 11:32
  • 1
    My estimate is that "reading lines part" will be the same but the whole "parsing work" will be very much faster. – Roman Kuzmin Mar 31 '20 at 11:44
  • You also need to add [HasFieldsEnclosedInQuotes](https://learn.microsoft.com/en-us/dotnet/api/microsoft.visualbasic.fileio.textfieldparser.hasfieldsenclosedinquotes?view=netframework-4.8). Overall CSV parsing with all the bells and whistles is quite slow. OPs current solution can be improved 2x by using [compiled regex](https://learn.microsoft.com/en-us/dotnet/standard/base-types/best-practices) – Alex Mar 31 '20 at 13:12
  • 1
    The example has fields with quotes and works as expected without changing this option. Apparently the default is what we need. – Roman Kuzmin Mar 31 '20 at 13:31
2

I've recently implemented a CSV parser for .NET, that I'm claiming is the fastest available as a nuget package: Sylvan.Data.Csv.

This can be used in conjunction with SqlBulkImport very easily:

using TextReader csvText = GetData(); // Gets a TextReader for a CSV dataset

var opts = new CsvDataReaderOptions {
 // indicates that empty fields should be processed as null
 // instead of as an empty string.
 Schema = CsvSchema.Nullable
};
DbDataReader dataReader = CsvDataReader.Create(csvText);

var csb = 
    new SqlConnectionStringBuilder { 
        DataSource = @"(LocalDb)\MSSqlLocalDb", 
        InitialCatalog = "Test", 
        IntegratedSecurity = true 
    };
using var conn = new SqlConnection(csb.ConnectionString);
conn.Open();


var bcp = new SqlBulkCopy(conn);
bcp.BulkCopyTimeout = 0;
bcp.DestinationTableName = "Feature";
bcp.BatchSize = 50000;
bcp.WriteToServer(dataReader);

This code was able to import a 2.2 million record data set in ~12 seconds on my machine. Using the above code, all columns will be interpreted as nullable strings.

It is also possible to provide a schema to the CsvDataReaderOptions. Assuming the target table in SQL Server has the correct schema applied, this can be done relatively easily also:


SqlConnection conn = ...;
var cmd = conn.CreateCommand();

// select an empty record set to get the table schema.
cmd.CommandText = "select top 0 * from [MyTargetTable]";
var reader = cmd.ExecuteReader();
var tableSchema = reader.GetColumnSchema();
reader.Close();
var csvSchema = new Schema(tableSchema);

var options = new CsvDataReaderOptions { Schema = csvSchema };
// the dataReader will now expose a strongly-typed schema which SqlBulkImport
// will be able to properly consume.
var dataReader = CsvDataReader.Create(csvText, options);


MarkPflug
  • 28,292
  • 8
  • 46
  • 54