Skip Header Row in a High Performance Powershell Regex Script Block

Question

I received some amazing help from Stack Overflow ... however ... it was so amazing I need a little more help to get to closer to the finish line. I'm parsing multiple enormous 4GB files 2X per month. I need be able to be able to skip the header, count the total lines, matched lines, and the not matched lines. I'm sure this is super-simple for a PowerShell superstar, but at my newbie PS level my skills are not yet strong. Perhaps a little help from you would save the week. :)

Data Sample:

ID         FIRST_NAME              LAST_NAME          COLUMN_NM_TOO_LON5THCOLUMN
 10000000001MINNIE                 MOUSE              COLUMN VALUE LONGSTARTS 
 10000000002MICKLE ROONEY          MOUSE              COLUMN VALUE LONGSTARTS

Code Block (based on this answer):

#$match_regex matches each fixed length field by length; the () specifies that each matched field be stored in a capture group:
[regex]$match_regex = '^(.{10})(.{50})(.{50})(.{50})(.{50})(.{3})(.{8})(.{4})(.{50})(.{2})(.{30})(.{6})(.{3})(.{4})(.{25})(.{2})(.{10})(.{3})(.{8})(.{4})(.{50})(.{2})(.{30})(.{6})(.{3})(.{2})(.{25})(.{2})(.{10})(.{3})(.{10})(.{10})(.{10})(.{2})(.{10})(.{50})(.{50})(.{50})(.{50})(.{8})(.{4})(.{50})(.{2})(.{30})(.{6})(.{3})(.{2})(.{25})(.{2})(.{10})(.{3})(.{4})(.{2})(.{4})(.{10})(.{38})(.{38})(.{15})(.{1})(.{10})(.{2})(.{10})(.{10})(.{10})(.{10})(.{38})(.{38})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})$'

Measure-Command {
        & { 
          switch -File $infile -Regex  {
            $match_regex {
               # Join what all the capture groups matched with a tab char.
               $Matches[1..($Matches.Count-1)].Trim() -join "`t"
            }
          }
        } | Out-File $outFile
}

Yes, I'm sure I need to initialize a few line counters ... ttl_lines, ttl_matched, ttl_not_matches and then have appropriate ++ and else blocks; however, I'm still syntactically challenged and I'm hoping someone can show where all that goodness would fit in this super-slick block of coding modified from one of your owns' (stackoverflow's) suggestion. Then, of course, when time permits I will study every possible way to optimize! (Calls to C++, Net we'll see ...) So far I'm super pleased with this version ... just stumbling a little with the housekeeping. — Mark, Nov 13 '19 at 21:45
x15 - The big regex slices and dices a fixed column length text file, then trims the captures groups and joins with tabs. I figured the person with this answer would almost immediately recognize the coding pattern and know where the stuff gets inserted, but sorry if I was too vague. — Mark, Nov 13 '19 at 22:07
@mklement0 An additional requirement, besides the counters and the header processing, I want to output non-matched rows to an error file and if possible write to a log to show progress in the script. This piece of code (above) is a small (but an important) part of a larger script. The larger script checks if this months files have already been loaded, gets a count of the last two months' loads from SQL Server, downloads available load files, unzips them, tries to do this parsing, and uploads the parsed files to the database Server. The rest works, I don't want y'all to think I'm just lazy. — Mark, Nov 14 '19 at 15:07
For writing an error line, I wondering if "switch -File $infile -Regex" method stores the line under consideration somewhere regardless of whether there's a match. I'll keep looking and testing, but tips are welcome. — Mark, Nov 14 '19 at 15:48
@Mark: `$_` contains the input line at hand in script blocks inside of `switch` statements. — mklement0, Nov 14 '19 at 16:07
@mklement0 OK didn't know that and didn't see it in any documentation. Thank you! I am trying the scoping comment below next. This PowerShell thing is amazing, yet a _little_ tricky. — Mark, Nov 14 '19 at 16:36
@Mark: Yes, on both counts. The documentation is often lacking, unfortunately, though things are improving. Let's try to help: https://github.com/MicrosoftDocs/PowerShell-Docs/issues/5096 — mklement0, Nov 14 '19 at 16:48
@mklement0 very nice contribution to the PowerShell community, documentation matters :) — Mark, Nov 14 '19 at 17:23
@x15 I commented my sample code to indicate the purpose of the regex. Just waiting for acceptance on the edit. — Mark, Nov 14 '19 at 17:35

score 2 · Accepted Answer · edited Nov 14 '19 at 23:03

2

You only need to keep track of two counts - matched, and unmatched lines - and then a Boolean to indicate whether you've skipped the first line

$first = $false
$matched = 0
$unmatched = 0
. {
    switch -File $infile -Regex  {
        $match_regex {
            if($first){
                # Join what all the capture groups matched with a tab char.
                $Matches[1..($Matches.Count-1)].Trim() -join "`t"
                $matched++
            }
            $first = $true
        }
        default{
            $unmatched++
            # you can remove this, if the pattern always matches the header
            $first = $true
        }
    }
} | Out-File $outFile

$total = $matched + $unmatched

edited Nov 14 '19 at 23:03

Mark

193
11

answered Nov 13 '19 at 22:00

Mathias R. Jessen

157,619
12
148
206

I pulled the variables out of the dot script block entirely to just before where my regex variable is being set and I'm testing the big file now. If it works, your answer with @mklement0 ointment is the solution. I will next work on seeing if outputting the unmatched rows to a separate file is super simple or another trip to infinity and beyond. – Mark Nov 14 '19 at 16:53
OK! Everything works now, even the output of unmatched rows to an error file. Yeah! The 1.8 GB file +100 columns parses in about 7.8 minutes. I'm very happy with this! – Mark Nov 14 '19 at 17:11
One thing, can you explain why $first starts with $false? Is this for some type of optimization? Since we read the file from top to bottom wouldn't $first start as $true. I'm all about clarity, so just wondering about the thought there. Also, default will match any error, so wouldn't we just test for true and then set for false if it is true (meaning the header happens to not match anything)? Anyway, not sure how things work on Stackoverflow so just curious. – Mark Nov 14 '19 at 17:21
@Mark Well, you haven't completed processing the first line before you enter the switch? :) – Mathias R. Jessen Nov 14 '19 at 17:24
OK, I see what you did there. $First means first line has been passed. I was thinking the other way around. Thanks for the clarification. – Mark Nov 14 '19 at 17:26
@mklement0 I'm trying to accept the above solution but I'm getting "An error has occurred - please retry your request." Do you know how I can get past this message and accept the solution? – Mark Nov 15 '19 at 17:58
@Mark: That is bizarre - that should only be a transient problem. Are there any extensions / userscripts installed in your browser that cause that? Try from a different browser app and see if that helps. – mklement0 Nov 15 '19 at 18:01
@mklement0 I'm looking at this from work I saw ... meta.stackoverflow.com/questions/319942/… and they described it as a network filtering problem ... surely I should be able to click a button regardless of network filtering? – – Mark Nov 15 '19 at 18:16
@Mark: I'm out of my depth here, but I hope the information at the page you cited, https://meta.stackoverflow.com/q/319942/45375, helps. – mklement0 Nov 15 '19 at 18:18

score 0 · Answer 2 · answered Dec 02 '19 at 21:30

Using System.IO.StreamReader reduced the processing time to about 20% of what it had been. This was absolutely needed for my requirement.

I added logic and counters without sacrificing much on performance. The field counter and row by row comparison is particularly helpful in finding bad records.

This is a copy/paste of actual code but I shortened some things, made some things slightly pseudo code, so you may have to play with it to get things working just so for yourself.

Function Get-Regx-Data-Format() {
    Param ([String] $filename)

    if ($filename -eq 'FILE NAME') {
        [regex]$match_regex = '^(.{10})(.{10})(.{10})(.{30})(.{30})(.{30})(.{4})(.{1})'
    }
    return $match_regex
}

Foreach ($file in $cutoff_files) {

  $starttime_for_file = (Get-Date)
  $source_file = $file + '_' + $proc_yyyymm + $source_file_suffix
  $source_path = $source_dir + $source_file

  $parse_file = $file + '_' + $proc_yyyymm + '_load' +$parse_target_suffix
  $parse_file_path = $parse_target_dir + $parse_file

  $error_file = $file + '_err_' + $proc_yyyymm + $error_target_suffix
  $error_file_path = $error_target_dir + $error_file

  [regex]$match_data_regex = Get-Regx-Data-Format $file

  Remove-Item -path "$parse_file_path" -Force -ErrorAction SilentlyContinue
  Remove-Item -path "$error_file_path" -Force -ErrorAction SilentlyContinue

  [long]$matched_cnt = 0
  [long]$unmatched_cnt = 0
  [long]$loop_counter = 0
  [boolean]$has_header_row=$true
  [int]$field_cnt=0
  [int]$previous_field_cnt=0
  [int]$array_length=0

  $parse_minutes = Measure-Command {
    try {
        $stream_log = [System.IO.StreamReader]::new($source_path)
        $stream_in = [System.IO.StreamReader]::new($source_path)
        $stream_out = [System.IO.StreamWriter]::new($parse_file_path)
        $stream_err = [System.IO.StreamWriter]::new($error_file_path)

        while ($line = $stream_in.ReadLine()) {

          if ($line -match $match_data_regex) {

              #if matched and it's the header, parse and write to the beg of output file
              if (($loop_counter -eq 0) -and $has_header_row) {
                  $stream_out.WriteLine(($Matches[1..($array_length)].Trim() -join "`t"))

              } else {
                  $previous_field_cnt = $field_cnt

                  #add year month to line start, trim and join every captured field w/tabs
                  $stream_out.WriteLine("$proc_yyyymm`t" + `
                         ($Matches[1..($array_length)].Trim() -join "`t"))

                  $matched_cnt++
                  $field_cnt=$Matches.Count

                  if (($previous_field_cnt -ne $field_cnt) -and $loop_counter -gt 1) {
                    write-host "`nError on line $($loop_counter + 1). `
                                The field count does not match the previous correctly `
                                formatted (non-error) row."
                  }

              }
          } else {
              if (($loop_counter -eq 0) -and $has_header_row) {
                #if the header, write to the beginning of the output file
                  $stream_out.WriteLine($line)
              } else {
                $stream_err.WriteLine($line)
                $unmatched_cnt++
              }
          }
          $loop_counter++
       }
    } finally {
        $stream_in.Dispose()
        $stream_out.Dispose()
        $stream_err.Dispose()
        $stream_log.Dispose()
    }
  } | Select-Object -Property TotalMinutes

  write-host "`n$file_list_idx. File $file parsing results....`nMatched Count = 
  $matched_cnt  UnMatched Count = $unmatched_cnt  Parse Minutes = $parse_minutes`n"

  $file_list_idx++

  $endtime_for_file = (Get-Date)
  write-host "`nEnded processing file at $endtime_for_file"

  $TimeDiff_for_file = (New-TimeSpan $starttime_for_file $endtime_for_file)
  $Hrs_for_file = $TimeDiff_for_file.Hours
  $Mins_for_file = $TimeDiff_for_file.Minutes
  $Secs_for_file = $TimeDiff_for_file.Seconds 
  write-host "`nElapsed Time for file $file processing: 
  $Hrs_for_file`:$Mins_for_file`:$Secs_for_file"

}

$endtime = (Get-Date -format "HH:mm:ss")
$TimeDiff = (New-TimeSpan $starttime $endtime)
$Hrs = $TimeDiff.Hours
$Mins = $TimeDiff.Minutes
$Secs = $TimeDiff.Seconds 
write-host "`nTotal Elapsed Time: $Hrs`:$Mins`:$Secs"

Skip Header Row in a High Performance Powershell Regex Script Block

2 Answers2