3

I have a working script who's objective is to parse data files for malformed rows before importing into Oracle. To process a 450MB csv file with > 1 million rows having 8 columns it takes a little over 2.5hrs and maxes a single CPU core. Small files complete quickly (in seconds).

Oddly a 350MB file with similar number of rows and 40 columns only takes 30 mins.

My issue is that the files will grow over time and 2.5 hours tying up a CPU ain't good. Can anyone recommend code optimisation ? A similarly title post recommended local paths - which I'm already doing.

$file = "\Your.csv"

$path = "C:\Folder"

$csv  = Get-Content "$path$file"

# Count number of file headers
$count = ($csv[0] -split ',').count

# https://blogs.technet.microsoft.com/gbordier/2009/05/05/powershell-and-writing-files-how-fast-can-you-write-to-a-file/
$stream1 = [System.IO.StreamWriter] "$path\Passed$file-Pass.txt"
$stream2 = [System.IO.StreamWriter] "$path\Failed$file-Fail.txt"

# 2 validation steps: (1) count number of headers is ge (2) Row split after first col.  Those right hand side cols must total at least 40 characters.
$csv | Select -Skip 1 | % {
  if( ($_ -split ',').count -ge $count -And ($_.split(',',2)[1]).Length -ge 40) {
     $stream1.WriteLine($_)
  } else {
     $stream2.WriteLine($_) 
  }
}
$stream1.close()
$stream2.close()

Sample Data File:

C1,C2,C3,C4,C5,C6,C7,C8
ABC,000000000000006732,1063,2016-02-20,0,P,ESTIMATE,2015473497A10
ABC,000000000000006732,1110,2016-06-22,0,P,ESTIMATE,2015473497A10
ABC,,2016-06-22,,201501
,,,,,,,,
ABC,000000000000006732,1135,2016-08-28,0,P,ESTIMATE,2015473497B10
ABC,000000000000006732,1167,2015-12-20,0,P,ESTIMATE,2015473497B10
henrycarteruk
  • 12,708
  • 2
  • 36
  • 40
felixmc
  • 516
  • 1
  • 4
  • 19
  • How about a few sample lines of your input file? And a description of what you are trying to do? – Mark Setchell Jan 27 '17 at 10:34
  • 3
    Get-Content is superslow. Use IO.StreamReader. Also use a different hard drive for the output or specify a large write buffer in the StreamWriter constructor. – wOxxOm Jan 27 '17 at 10:44
  • Now that's the sort of suggestion I'd hoped for! Good man, cheers. – felixmc Jan 27 '17 at 10:46
  • Can you place some Measure-Command { } blocks around each section of your code to see where the delays are? i.e.: is it the loading, the file writing, etc. – Simon Catlin Jan 27 '17 at 10:47
  • I'd thought about trying that Simon but I figured it had to be the If statement. Reading the file takes a while but watching Resource monitor I could see that the disc writes kicked off after a short while. I haven't considered that perhaps it's writing while still reading. I'll stick a Measure in...can't hurt! Cheers. – felixmc Jan 27 '17 at 10:50
  • @wOxxOm not really, post PS 3 get-content is almost as effective as streamreader. – 4c74356b41 Jan 27 '17 at 11:00
  • @4c74356b41, it isn't in this case because 1) it creates a huge array in the variable 2) the OP uses the very slow pipeline enumeration. My point was about StreamReader used in a standard fashion (properly) via `while` loop that reads one line at a time. – wOxxOm Jan 27 '17 at 11:08

3 Answers3

7
  • Get-Content is extremely slow in the default mode that produces an array when the file contains millions of lines on all PowerShell versions, including 5.1. What's worse, you're assigning it to a variable so until the entire file is read and split into lines nothing else happens. On Intel i7 3770K CPU at 3.9GHz $csv = Get-Content $path takes more than 2 minutes to read a 350MB file with 8 million lines.

    Solution: Use IO.StreamReader to read a line and process it immediately.
    In PowerShell2 StreamReader is less optimized than in PS3+ but still faster than Get-Content.


  • Pipelining via | is at least several times slower than direct enumeration via flow control statements such as while or foreach statement (not cmdlet).
    Solution: use the statements.

  • Splitting each line into an array of strings is slower than manipulating only one string.
    Solution: use IndexOf and Replace method (not operator) to count character occurrences.

  • PowerShell always creates an internal pipeline when loops are used.
    Solution: use the Invoke-Command { } trick for 2-3x speedup in this case!

Below is PS2-compatible code.
It's faster in PS3+ (30 seconds for 8 million lines in a 350MB csv on my PC).

$reader = New-Object IO.StreamReader ('r:\data.csv', [Text.Encoding]::UTF8, $true, 4MB)
$header = $reader.ReadLine()
$numCol = $header.Split(',').count

$writer1 = New-Object IO.StreamWriter ('r:\1.csv', $false, [Text.Encoding]::UTF8, 4MB)
$writer2 = New-Object IO.StreamWriter ('r:\2.csv', $false, [Text.Encoding]::UTF8, 4MB)
$writer1.WriteLine($header)
$writer2.WriteLine($header)

Write-Progress 'Filtering...' -status ' '
$watch = [Diagnostics.Stopwatch]::StartNew()
$currLine = 0

Invoke-Command { # the speed-up trick: disables internal pipeline
while (!$reader.EndOfStream) {
    $s = $reader.ReadLine()
    $slen = $s.length
    if ($slen-$s.IndexOf(',')-1 -ge 40 -and $slen-$s.Replace(',','').length+1 -eq $numCol){
        $writer1.WriteLine($s)
    } else {
        $writer2.WriteLine($s)
    }
    if (++$currLine % 10000 -eq 0) {
        $pctDone = $reader.BaseStream.Position / $reader.BaseStream.Length
        Write-Progress 'Filtering...' -status "Line: $currLine" `
            -PercentComplete ($pctDone * 100) `
            -SecondsRemaining ($watch.ElapsedMilliseconds * (1/$pctDone - 1) / 1000)
    }
}
} #Invoke-Command end

Write-Progress 'Filtering...' -Completed -status ' '
echo "Elapsed $($watch.Elapsed)"

$reader.close()
$writer1.close()
$writer2.close()

Another approach is to use regex in two passes (it's slower than the above code, though).
PowerShell 3 or newer is required due to array element property shorthand syntax:

$text = [IO.File]::ReadAllText('r:\data.csv')
$header = $text.substring(0, $text.indexOfAny("`r`n"))
$numCol = $header.split(',').count

$rx = [regex]"\r?\n(?:[^,]*,){$($numCol-1)}[^,]*?(?=\r?\n|$)"
[IO.File]::WriteAllText('r:\1.csv', $header + "`r`n" +
                                    ($rx.matches($text).groups.value -join "`r`n"))
[IO.File]::WriteAllText('r:\2.csv', $header + "`r`n" + $rx.replace($text, ''))
wOxxOm
  • 65,848
  • 11
  • 132
  • 136
  • Excellent answer @wOxxOm ... clear and concise with explanation why approach used was slow and what to use instead. Runtime went from a couple of hours to a few minutes. – felixmc Feb 02 '17 at 12:06
  • 2
    @felixmc, another 2x speedup thanks to Invoke-Command! See the updated code. – wOxxOm Feb 09 '17 at 21:00
  • "In PowerShell2 StreamReader is less optimized than in PS3+ but still faster than Get-Content" -- how is this so? Isn't StreamReader coming from .NET and not PowerShell? I'm curious since I thought I was invoking native .NET classes from PowerShell and not some other weird construct that I actually might be doing??? – thepip3r Sep 27 '18 at 22:13
3

If you feel like installing awk, you can do 1,000,000 records in under a second - seems like a good optimisation to me :-)

awk -F, '
   NR==1                    {f=NF; printf("Expecting: %d fields\n",f)}  # First record, get expected number of fields
   NF!=f                    {print > "Fail.txt"; next}                  # Fail for wrong field count
   length($0)-length($1)<40 {print > "Fail.txt"; next}                  # Fail for wrong length
                            {print > "Pass.txt"}                        # Pass
   ' MillionRecord.csv

You can get gawk for Windows from here.

Windows is a bit awkward with single quotes in parameters, so if running under Windows I would use the same code, but formatted like this:

Save this in a file called commands.awk:

NR==1                    {f=NF; printf("Expecting: %d fields\n",f)}
NF!=f                    {print > "Fail.txt"; next}
length($0)-length($1)<40 {print > "Fail.txt"; next}
                         {print > "Pass.txt"}

Then run with:

awk -F, -f commands.awk Your.csv

The remainder of this answer relates to a "Beat hadoop with the shell" challenge mentioned in the comments section, and I wanted somewhere to save my code, so it's here.... runs in 6.002 seconds on my iMac over the 3.5GB in 1543 files amounting to around 104 million records:

#!/bin/bash
doit(){
   awk '!/^\[Result/{next} /1-0/{w++;next} /0-1/{b++} END{print w,b}' $@
}

export -f doit
find . -name \*.pgn -print0 | parallel -0 -n 4 -j 12 doit {}
Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
  • A great answer but not an applicable solution unfortunately. Will keep in mind for future ref tho. Cheers. – felixmc Jan 27 '17 at 11:17
  • 3
    why use awk when posh can do it perfectly fine and faster? like look at this thread for example https://foxdeploy.com/2016/03/01/powershell-challenge-beating-hadoop-with-posh/ – 4c74356b41 Jan 27 '17 at 11:41
  • 1
    @4c74356b41 The OP asked for an optimisation and I think a 10,000x speed-up is a pretty good optimisation. Sometimes the best optimisations involve different algorithms or different tools. If, as OP has indicated, he wishes to stick to Powershell, then he is welcome to ignore my suggestion. – Mark Setchell Jan 27 '17 at 11:47
  • I'm def going to go off and see if I can implement gawk. Am leaving open to see if solutions specific to Powershell are provided. I completely agree with your 'different tools' assertion but in this case I have to consider support-ability. – felixmc Jan 27 '17 at 11:54
  • @4c74356b41 Interesting challenge - I had a little play and can comfortably beat their best time of 8.7 seconds using `awk` in 6.002 seconds. I don't know how comparable my system is with theirs obviously, but mine is just a desktop iMac, and nothing special. – Mark Setchell Jan 27 '17 at 15:14
1

Try experimenting with different looping strategies, for example, switching to a for loop cuts the processing time by more than 50%, e.g.:

[String]                 $Local:file           = 'Your.csv';
[String]                 $Local:path           = 'C:\temp';
[System.Array]           $Local:csv            = $null;
[System.IO.StreamWriter] $Local:objPassStream  = $null;
[System.IO.StreamWriter] $Local:objFailStream  = $null; 
[Int32]                  $Local:intHeaderCount = 0;
[Int32]                  $Local:intRow         = 0;
[String]                 $Local:strRow         = '';
[TimeSpan]               $Local:objMeasure     = 0;

try {
    # Load.
    $objMeasure = Measure-Command {
        $csv = Get-Content -LiteralPath (Join-Path -Path $path -ChildPath $file) -ErrorAction Stop;
        $intHeaderCount = ($csv[0] -split ',').count;
        } #measure-command
    'Load took {0}ms' -f $objMeasure.TotalMilliseconds;

    # Create stream writers.
    try {
        $objPassStream = New-Object -TypeName System.IO.StreamWriter ( '{0}\Passed{1}-pass.txt' -f $path, $file );
        $objFailStream = New-Object -TypeName System.IO.StreamWriter ( '{0}\Failed{1}-fail.txt' -f $path, $file );

        # Process CSV (v1).
        $objMeasure = Measure-Command {
            $csv | Select-Object -Skip 1 | Foreach-Object { 
                if( (($_ -Split ',').Count -ge $intHeaderCount) -And (($_.Split(',',2)[1]).Length -ge 40) ) {
                    $objPassStream.WriteLine( $_ );   
                } else {
                    $objFailStream.WriteLine( $_ );
                } #else-if
                } #foreach-object
            } #measure-command
        'Process took {0}ms' -f $objMeasure.TotalMilliseconds;

        # Process CSV (v2).
        $objMeasure = Measure-Command {
            for ( $intRow = 1; $intRow -lt $csv.Count; $intRow++ ) {
                if( (($csv[$intRow] -Split ',').Count -ge $intHeaderCount) -And (($csv[$intRow].Split(',',2)[1]).Length -ge 40) ) {
                    $objPassStream.WriteLine( $csv[$intRow] );   
                } else {
                    $objFailStream.WriteLine( $csv[$intRow] );
                } #else-if
                } #for
            } #measure-command
        'Process took {0}ms' -f $objMeasure.TotalMilliseconds;

        } #try
    catch [System.Exception] {
        'ERROR : Failed to create stream writers; exception was "{0}"' -f $_.Exception.Message;
         } #catch
    finally {
        $objFailStream.close();
        $objPassStream.close();    
        } #finally

   } #try
catch [System.Exception] {
    'ERROR : Failed to load CSV.';
    } #catch

exit 0;
Simon Catlin
  • 2,141
  • 1
  • 13
  • 15
  • A REALLY good answer @Simon Catlin - great template that I can see me coming back to and using for testing. 50% reduction time on the 2nd block. Superb! – felixmc Feb 02 '17 at 12:09