0

I am trying to load 160gb csv file to sql and I am using powershell script I got from Github and I get this error

    IException calling "Add" with "1" argument(s): "Input array is longer than the number of columns in this table."
At C:\b.ps1:54 char:26
+ [void]$datatable.Rows.Add <<<< ($line.Split($delimiter))
    + CategoryInfo          : NotSpecified: (:) [], MethodInvocationException
    + FullyQualifiedErrorId : DotNetMethodException

So I checked the same code with small 3 line csv and all of the columns match and also have header in first row and there are no extra delimiters not sure why I am getting this error.

The code is below

<# 8-faster-runspaces.ps1 #>
# Set CSV attributes
$csv = "M:\d\s.txt"
$delimiter = "`t"

# Set connstring
$connstring = "Data Source=.;Integrated Security=true;Initial Catalog=PresentationOptimized;PACKET SIZE=32767;"

# Set batchsize to 2000
$batchsize = 2000

# Create the datatable
$datatable = New-Object System.Data.DataTable

# Add generic columns
$columns = (Get-Content $csv -First 1).Split($delimiter) 
foreach ($column in $columns) { 
[void]$datatable.Columns.Add()
}

# Setup runspace pool and the scriptblock that runs inside each runspace
$pool = [RunspaceFactory]::CreateRunspacePool(1,5)
$pool.ApartmentState = "MTA"
$pool.Open()
$runspaces = @()

# Setup scriptblock. This is the workhorse. Think of it as a function.
$scriptblock = {
   Param (
[string]$connstring,
[object]$dtbatch,
[int]$batchsize
   )

$bulkcopy = New-Object Data.SqlClient.SqlBulkCopy($connstring,"TableLock")
$bulkcopy.DestinationTableName = "abc"
$bulkcopy.BatchSize = $batchsize
$bulkcopy.WriteToServer($dtbatch)
$bulkcopy.Close()
$dtbatch.Clear()
$bulkcopy.Dispose()
$dtbatch.Dispose()
}

# Start timer
$time = [System.Diagnostics.Stopwatch]::StartNew()

# Open the text file from disk and process.
$reader = New-Object System.IO.StreamReader($csv)

Write-Output "Starting insert.."
while ((($line = $reader.ReadLine()) -ne $null))
{
[void]$datatable.Rows.Add($line.Split($delimiter))

if ($datatable.rows.count % $batchsize -eq 0) 
{
   $runspace = [PowerShell]::Create()
   [void]$runspace.AddScript($scriptblock)
   [void]$runspace.AddArgument($connstring)
   [void]$runspace.AddArgument($datatable) # <-- Send datatable
   [void]$runspace.AddArgument($batchsize)
   $runspace.RunspacePool = $pool
   $runspaces += [PSCustomObject]@{ Pipe = $runspace; Status = $runspace.BeginInvoke() }

   # Overwrite object with a shell of itself
  $datatable = $datatable.Clone() # <-- Create new datatable object
}
}

# Close the file
$reader.Close()

# Wait for runspaces to complete
while ($runspaces.Status.IsCompleted -notcontains $true) {}

# End timer
$secs = $time.Elapsed.TotalSeconds

# Cleanup runspaces 
foreach ($runspace in $runspaces ) { 
[void]$runspace.Pipe.EndInvoke($runspace.Status) # EndInvoke method retrieves the results of the asynchronous call
$runspace.Pipe.Dispose()
}

# Cleanup runspace pool
$pool.Close() 
$pool.Dispose()

# Cleanup SQL Connections
[System.Data.SqlClient.SqlConnection]::ClearAllPools()

# Done! Format output then display
$totalrows = 1000000
$rs = "{0:N0}" -f [int]($totalrows / $secs)
$rm = "{0:N0}" -f [int]($totalrows / $secs * 60)
$mill = "{0:N0}" -f $totalrows

Write-Output "$mill rows imported in $([math]::round($secs,2)) seconds ($rs rows/sec and $rm rows/min)"
Illuminati
  • 555
  • 2
  • 7
  • 34
  • Usually in cases like this, this error indicates that some line has an unexpected embedded delimiter... tab, in this case. It is just a bit of dirty input data. You can try reading in the rows, replacing the tabs with empty strings, then seeing which row(s) have more tabs than you have columns by comparing the original and shrunk size. If you have four columns, you would expect a row to shrink by three characters. – Laughing Vergil Apr 07 '17 at 16:38
  • @Laughing Vergil Thank you for your response tried the same for comma separated file and I get the same error – Illuminati Apr 07 '17 at 16:45
  • You also may have a problem with line terminators on some data - having a UNIX style newline in the middle of a Windows style text file ends up with two rows in one data retrieval action. Embedded newlines or CR/LF pairs can also mess up your processing. – Laughing Vergil Apr 07 '17 at 16:56
  • so is there any way we can solve this? – Illuminati Apr 07 '17 at 17:14
  • Here is one answer on [newlines](http://stackoverflow.com/questions/724083/unix-newlines-to-windows-newlines-on-windows) – BenH Apr 07 '17 at 18:38

1 Answers1

1

Working with a 160 GB input file is going to be a pain. You can't really load it into any kind of editor - or at least you don't really analyze such a data mass without some serious automation.

As per the comments, it seems that the data has some quality issues. In order to find the offending data, you could try binary searching. This approach shrinks the data fast. Like so,

1) Split the file in about two equal chunks.
2) Try and load first chunk.
3) If successful, process the second chunk. If not, see 6).
4) Try and load second chunk.
5) If successful, the files are valid, but you got another a data quality issue. Start looking into other causes. If not, see 6).
6) If either load failed, start from the beginning and use the failed file as the input file.
7) Repeat until you narrow down the offending row(s).

Another a method would be using an ETL tool like SSIS. Configure the package to redirect invalid rows into an error log to see what data is not working properly.

vonPryz
  • 22,996
  • 7
  • 54
  • 65