1

Monthly I receive several very large (~ 4 GB) fixed column width text file that needs to be imported into MS SQL Server. To import the file, the file must be converted into a text file with tab-delimited column values with spaces trimmed from each column value (some columns have no spaces). I'd like to use PowerShell to solve this and I'd like the code to be very, very fast.

I tried many iterations of code but so far too slow or not working. I've tried the Microsoft Text Parser (too slow). I've tried regex matching. I'm working on a Windows 7 machine with PowerShell 5.1 installed.

 ID         FIRST_NAME              LAST_NAME          COLUMN_NM_TOO_LON5THCOLUMN
 10000000001MINNIE                  MOUSE              COLUMN VALUE LONGSTARTS 

$infile = "C:\Testing\IN_AND_OUT_FILES\srctst.txt"
$outfile = "C:\Testing\IN_AND_OUT_FILES\outtst.txt"

$batch = 1

[regex]$match_regex = '^(.{10})(.{50})(.{50})(.{50})(.{50})(.{3})(.{8})(.{4})(.{50})(.{2})(.{30})(.{6})(.{3})(.{4})(.{25})(.{2})(.{10})(.{3})(.{8})(.{4})(.{50})(.{2})(.{30})(.{6})(.{3})(.{2})(.{25})(.{2})(.{10})(.{3})(.{10})(.{10})(.{10})(.{2})(.{10})(.{50})(.{50})(.{50})(.{50})(.{8})(.{4})(.{50})(.{2})(.{30})(.{6})(.{3})(.{2})(.{25})(.{2})(.{10})(.{3})(.{4})(.{2})(.{4})(.{10})(.{38})(.{38})(.{15})(.{1})(.{10})(.{2})(.{10})(.{10})(.{10})(.{10})(.{38})(.{38})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})$'
[regex]$replace_regex = "`${1}`t`${2}`t`${3}`t`${4}`t`${5}`t`${6}`t`${7}`t`${8}`t`${9}`t`${10}`t`${11}`t`${12}`t`${13}`t`${14}`t`${15}`t`${16}`t`${17}`t`${18}`t`${19}`t`${20}`t`${21}`t`${22}`t`${23}`t`${24}`t`${25}`t`${26}`t`${27}`t`${28}`t`${29}`t`${30}`t`${31}`t`${32}`t`${33}"

Get-Content $infile -ReadCount $batch |

    foreach {

        $_ -replace $match_regex, $replace_regex | Out-File $outfile -Append

    }

Any help you can give is appreciated!

Mark
  • 193
  • 11
  • 1
    `$element = $_.trim()` will yield nothing since you are not using `foreach-object`. `$element = $element.trim()` will yield better results. – AdminOfThings Nov 08 '19 at 22:28
  • 1
    If you want to convert a list into a tab-delimited string, just use ```$list -join "`t"```. – AdminOfThings Nov 08 '19 at 22:52
  • 2
    [1] `-ReadCount` defaults to `1`, so you are not gaining anything that way. [*grin*] [2] `\t` is not valid in PoSh for a tab - that uses a backtick instead of a slash. it may work in the dotnet `[regex]` calls, tho. [3] have you looked at `StreamReader` yet? that is the usual recommendation for fast text file reads/writes. [4] you can use `$Matches[1..($Matches.Count -1)] -join "`t"` [there is supposed to be a backtick instead of a space just before the `t`in the preceding code] to build your tab delimited string from the capture groups. – Lee_Dailey Nov 09 '19 at 01:13
  • 2
    Do not use `$Input`, as normal variable it is preserved as an [automatic variable](https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_automatic_variables?view=powershell-6#input) in PowerShell. – iRon Nov 09 '19 at 08:41
  • Good points, @Lee_Dailey, but for large files it's always worth considering `switch -File` first, before resorting to more complex solutions based on .NET types. – mklement0 Nov 09 '19 at 13:30
  • 1
    The sentiment is appreciated, @Mark, but note that for commenters here to be notified of your follow-up comment, you'd have to @-mention them, and the catch is that you can only @-mention _one_ user at time - see https://meta.stackexchange.com/a/43020/248777 – mklement0 Nov 13 '19 at 17:12

2 Answers2

2

The switch statement with the -File option is the fastest way to process large files in PowerShell[1]:

& { 
  switch -File $infile -Regex  {
    $match_regex {
       # Join the what all the capture groups matched, trimmed, with a tab char.
       $Matches[1..($Matches.Count-1)].Trim() -join "`t"
    }
  }
} | Out-File $outFile # or: Set-Content $outFile (beware encoding issues)

With text output, Out-File and Set-Content can be used interchangeably, but not that in Windows PowerShell they use different character encodings by default (UTF-16LE vs. Ansi); use -Encoding as needed; PowerShell Core uses BOM-less UTF-8 consistently.

Note:

  • To skip the header row or capture it separately, either provide a separate regex for it, or, if the header also matches the data-row regex, initialize a line index variable before the switch statement (e.g., $i = 0) and check and increment that variable in the processing script block (e.g., if ($i++ -eq 0) { ... }).

  • .Trim() is implicit called on each string in the array returned by $Matches[1..($Matches.Count-1)]; this feature is called member-access enumeration

  • The reason that the switch statement is enclosed in & { ... } (a script block ({ ... }) invoked with the call operator (&)) is that compound statements such as switch / while, foreach (...), ... aren't directly supported as pipeline input - see this GitHub issue.


As for what you tried:

As iRon points out, you shouldn't use $Input as a user variable - it is an automatic variable managed by PowerShell, and, in fact, whatever you assign to it is quietly discarded.

As AdminOfThings points out:

  • $element = $_.trim() doesn't work, because you're inside a foreach loop, not in the pipeline with a ForEach-Object cmdlet (even though the latter is also aliased to foreach; only with ForEach-Object would $_ be set to the current input object.

  • There is no need for custom function just for joining the elements of an array with a separator; the -join operator does that directly, as shown above.

Lee_Daily shows how to use -join directly with the $Matches array, as used above.

A few asides:

Join-Str($matches)

You should use Join-Str $matches instead:

In PowerShell, functions are invoked like shell commands - foo arg1 arg2 - not like C# methods - foo(arg1, arg2); see Get-Help about_Parsing.
If you use , to separate arguments, you'll construct an array that a function sees as a single argument.
To prevent accidental use of method syntax, use Set-StrictMode -Version 2 or higher, but note its other effects.

| Out-Null

An almost always faster method of output suppression is to use $null = ... instead.


[1] Mark (the OP) reports a dramatic speed-up compared to the Get-Content + ForEach-Object approach in the question (the switch solution takes 7.7 mins. for a 4GB file).
While a switch solution is likely fast enough in most scenarios, this answer shows a solution that may be faster for high iteration counts; this answer contrasts it with a switch solution and shows benchmarks with varying iteration counts.
Beyond that, a compiled solution written in, say, C#, is the only way to further improve performance.

mklement0
  • 382,024
  • 64
  • 607
  • 775
0

Here's the high level of my working code. Note that System.IO.StreamReader use was essential to get the processing time to an acceptable level. Thanks for all the help that led to me getting here.

Function Get-Regx-Data-Format() {
    Param ([String] $filename)

    if ($filename -eq 'FILE NAME') {
        [regex]$match_regex = '^(.{10})(.{10})(.{10})(.{30})(.{30})(.{30})(.{4})(.{1})'
    }
    return $match_regex
}

Foreach ($file in $cutoff_files) {

  $starttime_for_file = (Get-Date)
  $source_file = $file + '_' + $proc_yyyymm + $source_file_suffix
  $source_path = $source_dir + $source_file

  $parse_file = $file + '_' + $proc_yyyymm + '_load' +$parse_target_suffix
  $parse_file_path = $parse_target_dir + $parse_file

  $error_file = $file + '_err_' + $proc_yyyymm + $error_target_suffix
  $error_file_path = $error_target_dir + $error_file

  [regex]$match_data_regex = Get-Regx-Data-Format $file

  Remove-Item -path "$parse_file_path" -Force -ErrorAction SilentlyContinue
  Remove-Item -path "$error_file_path" -Force -ErrorAction SilentlyContinue

  [long]$matched_cnt = 0
  [long]$unmatched_cnt = 0
  [long]$loop_counter = 0
  [boolean]$has_header_row=$true
  [int]$field_cnt=0
  [int]$previous_field_cnt=0
  [int]$array_length=0

  $parse_minutes = Measure-Command {
    try {
        $stream_log = [System.IO.StreamReader]::new($source_path)
        $stream_in = [System.IO.StreamReader]::new($source_path)
        $stream_out = [System.IO.StreamWriter]::new($parse_file_path)
        $stream_err = [System.IO.StreamWriter]::new($error_file_path)

        while ($line = $stream_in.ReadLine()) {

          if ($line -match $match_data_regex) {

              #if matched and it's the header, parse and write to the beg of output file
              if (($loop_counter -eq 0) -and $has_header_row) {
                  $stream_out.WriteLine(($Matches[1..($array_length)].Trim() -join "`t"))

              } else {
                  $previous_field_cnt = $field_cnt

                  #add year month to line start, trim and join every captured field w/tabs
                  $stream_out.WriteLine("$proc_yyyymm`t" + `
                         ($Matches[1..($array_length)].Trim() -join "`t"))

                  $matched_cnt++
                  $field_cnt=$Matches.Count

                  if (($previous_field_cnt -ne $field_cnt) -and $loop_counter -gt 1) {
                    write-host "`nError on line $($loop_counter + 1). `
                                The field count does not match the previous correctly `
                                formatted (non-error) row."
                  }

              }
          } else {
              if (($loop_counter -eq 0) -and $has_header_row) {
                #if the header, write to the beginning of the output file
                  $stream_out.WriteLine($line)
              } else {
                $stream_err.WriteLine($line)
                $unmatched_cnt++
              }
          }
          $loop_counter++
       }
    } finally {
        $stream_in.Dispose()
        $stream_out.Dispose()
        $stream_err.Dispose()
        $stream_log.Dispose()
    }
  } | Select-Object -Property TotalMinutes

  write-host "`n$file_list_idx. File $file parsing results....`nMatched Count = 
  $matched_cnt  UnMatched Count = $unmatched_cnt  Parse Minutes = $parse_minutes`n"

  $file_list_idx++

  $endtime_for_file = (Get-Date)
  write-host "`nEnded processing file at $endtime_for_file"

  $TimeDiff_for_file = (New-TimeSpan $starttime_for_file $endtime_for_file)
  $Hrs_for_file = $TimeDiff_for_file.Hours
  $Mins_for_file = $TimeDiff_for_file.Minutes
  $Secs_for_file = $TimeDiff_for_file.Seconds 
  write-host "`nElapsed Time for file $file processing: 
  $Hrs_for_file`:$Mins_for_file`:$Secs_for_file"

}

$endtime = (Get-Date -format "HH:mm:ss")
$TimeDiff = (New-TimeSpan $starttime $endtime)
$Hrs = $TimeDiff.Hours
$Mins = $TimeDiff.Minutes
$Secs = $TimeDiff.Seconds 
write-host "`nTotal Elapsed Time: $Hrs`:$Mins`:$Secs"
Mark
  • 193
  • 11
  • @mklement0 I thought you might like to scan my solution. System.IO.StreamReader decreased processing time by about 75%. – Mark Dec 02 '19 at 21:40