I'm currently trying to import a CSV file of 20gbs (roughly 64 million rows, 58 columns) to a mssql database.
First I tried to do it with SSIS but is was so slow I decided to try using Powershell instead and found a nice query here:
High performance import of csv
The query is very quick, I manage to insert roughly 1 million rows per minute. However , I need to be able to handle delimiters embedded in quotes, like this: Column1,"Car,plane,boat",Column3
I did it using regex by the authors recommendation by switching:
$null = $datatable.Rows.Add($line.Split($csvdelimiter))
to:
$null = $datatable.Rows.Add($([regex]::Split($line, $csvSplit, $regexOptions)))
Full query:
# Database variables
$sqlserver = "server"
$database = "database"
$table = "tablename"
# CSV variables
$csvfile = "filepath"
$csvdelimiter = ","
$firstRowColumnNames = $true
$fieldsEnclosedInQuotes = $true
# Handling of regex for comma problem
if ($fieldsEnclosedInQuotes) {
$csvSplit = "($csvdelimiter)"
$csvsplit += '(?=(?:[^"]|"[^"]*")*$)'
} else { $csvsplit = $csvdelimiter }
$regexOptions = [System.Text.RegularExpressions.RegexOptions]::ExplicitCapture
################### No need to modify anything below ###################
Write-Host "Script started..."
$elapsed = [System.Diagnostics.Stopwatch]::StartNew()
[void][Reflection.Assembly]::LoadWithPartialName("System.Data")
[void][Reflection.Assembly]::LoadWithPartialName("System.Data.SqlClient")
# 50k worked fastest and kept memory usage to a minimum
$batchsize = 50000
# Build the sqlbulkcopy connection, and set the timeout to infinite
$connectionstring = "Data Source=$sqlserver;Integrated Security=true;Initial Catalog=$database;"
$bulkcopy = New-Object Data.SqlClient.SqlBulkCopy($connectionstring, [System.Data.SqlClient.SqlBulkCopyOptions]::TableLock)
$bulkcopy.DestinationTableName = $table
$bulkcopy.bulkcopyTimeout = 0
$bulkcopy.batchsize = $batchsize
# Create the datatable, and autogenerate the columns.
$datatable = New-Object System.Data.DataTable
# Open the text file from disk
$reader = New-Object System.IO.StreamReader($csvfile)
$firstline = (Get-Content $csvfile -First 1)
$columns = [regex]::Split($firstline, $csvSplit, $regexOptions)
if ($firstRowColumnNames -eq $true) { $null = $reader.readLine() }
foreach ($column in $columns) {
$null = $datatable.Columns.Add()
}
# Read in the data, line by line
while (($line = $reader.ReadLine()) -ne $null) {
$null = $datatable.Rows.Add($([regex]::Split($line, $csvSplit, $regexOptions)))
$i++; if (($i % $batchsize) -eq 0) {
$bulkcopy.WriteToServer($datatable)
Write-Host "$i rows have been inserted in $($elapsed.Elapsed.ToString())."
$datatable.Clear()
}
}
# add in all the remaining rows since the last clear
if($datatable.rows.count -gt 0) {
$bulkcopy.writetoserver($datatable)
$datatable.clear()
}
# Clean Up
$reader.Close(); $reader.Dispose()
$bulkcopy.Close(); $bulkcopy.Dispose()
$datatable.Dispose()
Write-Host "Script complete. $i rows have been inserted into the database."
Write-Host "Total Elapsed Time: $($elapsed.Elapsed.ToString())"
# Sometimes the Garbage Collector takes too long to clear the huge datatable.
[System.GC]::Collect()
pause
It takes much longer time with the regex:
24 seconds per 50 000k rows (with handling of delimiters embedded in qoutes)
2 seconds per 50 000k rows (without the handling)
Am I doing something wrong? Is regex the correct way to this? Can I improve the query performance in any way or is this performance lost something I have to accept?
Update: Added fully query