PowerShell Slow processing large files

Question

I have a PowerShell program that reads a file and looks for accounts in another table and then updates and writes out a new file with the extra information. It actually passes the file twice. Once to accumulate totals and then again to build the output file. (Whether it's a table or a file pass, I cannot sort the data because it has to be kept in original order so that transactions stay in balance.)

I've tried loading the file into an array and passing that twice and I've tried reading the file twice without building the array. Processing the file twice is orders of magnitude faster and that's not counting how long it takes to load the file, this is just my loop time.

When I run with a file that is about 12 MB with 84,000 rows, it runs fairly quickly. The whole thing takes about 5 minutes (Processing the file, not using a table).

When I process a larger file, (1.2 GB with 7 million records.) it takes way longer. The one with a table estimates it will take over 2 days. Running it by reading the file twice takes about 8 hours.

I know 7 million records is a lot, but 8 hours is a REALLY long time on a Windows server with 256 GB of RAM and way more processors than PowerShell can even use. (Even the 5 minutes to process 84,000 records seems like an eternity.)

When I was building the table, PowerShell would consume over 30 GB of RAM. It seems that when PowerShell has a large memory footprint, the performance drops drastically.

I've read a lot of things about PowerShell performance. I am not building strings with ++, I'm not piping a lot of things into other things. This really is a very basic, read a csv into an array, and then read a file and for each record in the file pass the array and when an item on the file is found in the array, do some simple calculations and move on. (the array isn't sorted because I am actually searching on 2 different fields in the array and there are duplicate values in both, so I have to process the whole thing each time, but, the array has less than 400 records in it.)

So, when I do the math, processing 7 million records, each searching 400 records, it sounds like a lot, and it is, but still, 8 hours on a high-end windows server? Is this really how slow PowerShell is?

Thank you to any and all who have read this far and have suggestions or advice. -Todd

EDIT: Here is some of the code. The table that is read into memory is loaded with this command:

$ICList = Import-Csv -Path $Path2 -encoding Default

It has 31 sets of buckets that are used to accumulate totals for each day of the month (2 totals and on percent per day)

Here are the headers and a sample record:

ICKey,ICAC,OffsetAC,OffsetBank,Type,Last,Total,offtotal,pct,Total01,offtotal01,pct01,Total02,offtotal02,pct02,Total03,offtotal03,pct03,Total04,offtotal04,pct04,Total05,offtotal05,pct05,Total06,offtotal06,pct06,Total07,offtotal07,pct07,Total08,offtotal08,pct08,Total09,offtotal09,pct09,Total10,offtotal10,pct10,Total11,offtotal11,pct11,Total12,offtotal12,pct12,Total13,offtotal13,pct13,Total14,offtotal14,pct14,Total15,offtotal15,pct15,Total16,offtotal16,pct16,Total17,offtotal17,pct17,Total18,offtotal18,pct18,Total19,offtotal19,pct19,Total20,offtotal20,pct20,Total21,offtotal21,pct21,Total22,offtotal22,pct22,Total23,offtotal23,pct23,Total24,offtotal24,pct24,Total25,offtotal25,pct25,Total26,offtotal26,pct26,Total27,offtotal27,pct27,Total28,offtotal28,pct28,Total29,offtotal29,pct29,Total30,offtotal30,pct30,Total31,offtotal31,pct31
123456XX0012,123456,987654,XX0012,12m,no,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

The file that is read in the next loop looks like this:

001,Company_Reference_ID,XX0012,USD,ACTUALS,2020-12-31,Conversion,,,,00000003,,,000000005376026.81,,,,,,123456,00100217,,,

The loop that accumulates the totals looks like this:

$reader = [System.IO.File]::OpenText($path)
$line = $reader.ReadLine()

try {
    while ($null -ne ($line = $reader.ReadLine())) {
        $item = $line.Split(",")
        $thisDate = $item[5].substring(8, 2)   #AccountingDate
        $ttot = 'Total' + $thisDate
        $toff = 'Offtotal' + $thisDate
        foreach ($item2 in $ICList) { 
            $itco = $item[2]                  #CompanyReferenceID
            $itkey = $item[19]                #ExternalCode_Account
            if ($item2.Type -eq 'm2m' -and $itco -ne 'LE0001') { $itco = 'LE0004' }
            if ($itkey -eq '109495') { $itkey = '109479' }
            if ($itkey -eq '172574') { $itkey = '172697' }
            if ($itkey -eq '172718') { $itkey = '172590' }
        $thisrecord = $itkey + $itco
            if ($thisrecord -eq $item2.ICKey) {
                $item2.$ttot = [Double]$item2.$ttot + [Double]$item[14] - [Double]$item[13]     # .Creditamount .Debitamount
            } 
            if ($itkey -eq $item2.offsetac) {
                $Item2.$toff = [Double]$Item2.$toff + [Double]$item[14] - [Double]$item[13]   #.Creditamount .Debitamount
            }
        }
        $count++
    }
}
catch {
    Write-Host $_.Exception.Message
}
finally {
    $reader.Close()
}

The above loop takes the bulk of the time.

I then have a little loop that spins through the $ICList table to calculate percentages:

foreach ($item2 in $ICList) { 
    foreach ($day in $days) {
        $ttot = 'Total' + '{0:d2}' -f $day
        $toff = 'Offtotal' + '{0:d2}' -f $day
        $tpct = 'Pct' + '{0:d2}' -f $day
        $item2.$tpct = ([double]$Item2.$ttot) / ([double]$Item2.$toff)
    }
}

And then I read the file again and write it out with extra records to split a single amount to multiple companies and adds new column with the company to the end. This part actually runs faster than the first loop. (Even though though it's doing a lot more processing)

$first = 1
$count = 1

$reader = [System.IO.File]::OpenText($path)
$outputFileName = [io.path]::ChangeExtension($path, "txt") 
try {
$outputFileName = [io.path]::GetFileNameWithoutExtension($path) + '-' + $date + ".fix.csv"
$outputFileName = [io.path]::Combine([io.path]::GetDirectoryName($path), $outputFileName)
$writer = [System.IO.StreamWriter] $outputFileName
$line = $reader.ReadLine()
$writer.WriteLine($line)
while ($null -ne ($line = $reader.ReadLine())) {
    $item = $line.Split(",")
        
    if ($first -eq 1) {
        $lastjourn = $item[0]        #.journalkey
        $first = 999
    }
    if ($lastjourn -ne $item[0]) {
        # .JournalKey 
        $count = 1
    }
    $offsetbank = $null
    $chgCR = $item[14]                #.CreditAmount
    $chgDB = $item[13]                #.DebitAmount
    foreach ($item2 in $ICList) { 
        $itkey = $item[19]            #.ExternalCode_Account
        if ($itkey -eq $item2.offsetac) {
            $newCR = $null
            $newDB = $null
            $offsetbank = $item2.offsetbank
            $tpct = 'Pct' + $item[5].substring(8, 2)  #accountingDate   
            if ($item2.$tpct -ne 1) {

                if ('-' + $item[14] + '-' -ne '--') {
                    #.CreditAmount 

                    $newCR = [math]::round([double]$item[14] * $item2.$tpct, 2)    #.CreditAmount
                       if ($item2.last -eq 'no') {
                            $chgCR = [math]::round([double]$chgCR - [double]$newCR, 2)
                        }
    
                }
            
                if ('-' + $item[13] + '-' -ne '--') {
                    #.DebitAmount
                    $thispct = $item2.$tpct * -1
                    $newDB = [math]::round([double]$item[13] * $thispct, 2)     #.DebitAmount
                        if ($item2.last -eq 'no') {
                            $chgDB = [math]::round([double]$chgDB - [double]$newDB, 2)
                        }
                }
            
                if ($item[19] -eq '202884' -or $item[19] -eq '103886') {
                }
                if (('-' + $NewDB + '-' -ne '--' -and $newdb -ne 0) -or ('-' + $newCR + '-' -ne '--' -and $newcr -ne 0)) {

                    if ($item2.last -eq 'yes') {

                        $item[14] = [double]$chgCR        #.CreditAmount
                        $item[13] = [double]$chgDB        #.DebitAmount
                        $offsetbank = $item2.offsetbank
                    }
                    else {
                        $tmpcr = [double]$newcr                   
                        $tmpdb = [double]$newdb                 
                        $xyzd = 0
                        $xyzc = 0
                    
                        if ($tmpcr -eq 0) {
                            $xyzd = [double]$tmpdb.ToString("0.00")
                            $writer.WriteLine("{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12},{13},{14},{15},{16},{17},{18},{19},{20},{21},{22},{23}"
                                , $item[0], $item[1], $item[2], $item[3], $item[4], $item[5], $item[6], $item[7], $item[8], $item[9]
                                , $count, $item[11], $item[12], [math]::abs($xyzd.tostring("0.00")), $null, $item[15], $item[16], $item[17], $item[18], $item[19]
                                , $item[20], $item[21], 'INTERCOMPANY_AFFILIATE_ID_EXTRA', $item2.offsetbank)
                        }
                        if ($tmpdb -eq 0) {
                            $xyzc = [double]$tmpcr.ToString("0.00")
                            $writer.WriteLine("{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12},{13},{14},{15},{16},{17},{18},{19},{20},{21},{22},{23}"
                                , $item[0], $item[1], $item[2], $item[3], $item[4], $item[5], $item[6], $item[7], $item[8], $item[9]
                                , $count, $item[11], $item[12], $null, [math]::abs($xyzc.tostring("0.00")), $item[15], $item[16], $item[17], $item[18], $item[19]
                                , $item[20], $item[21], 'INTERCOMPANY_AFFILIATE_ID_EXTRA', $item2.offsetbank)
                        }
                        $count++
                    }
                }
                else {
                    if ($item2.last -eq 'yes') {

                        $item[14] = [double]$chgCR        #.CreditAmount
                        $item[13] = [double]$chgDB        #.DebitAmount
                        $offsetbank = $item2.offsetbank
                    }
                }
            }
        }
    }
        
    $tmpcr = [double]$item[14]                     
    $tmpdb = [double]$item[13]                     
    $xyzd = 0
    $xyzc = 0
    if ($tmpcr -eq 0) { 
        $xyzd = [double]$tmpdb.ToString("0.00")
        $writer.WriteLine("{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12},{13},{14},{15},{16},{17},{18},{19},{20},{21},{22},{23}",
            $item[0], $item[1], $item[2], $item[3], $item[4], $item[5], $item[6], $item[7], $item[8], $item[9],
            $count, $item[11], $item[12], $xyzd.tostring("0.00"), $null, $item[15], $item[16], $item[17], $item[18], $item[19],
            $item[20], $item[21], 'INTERCOMPANY_AFFILIATE_ID', $offsetbank)        
    }


    if ($tmpdb -eq 0) { 
        $xyzc = [double]$tmpcr.ToString("0.00")
        $writer.WriteLine("{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12},{13},{14},{15},{16},{17},{18},{19},{20},{21},{22},{23}",
            $item[0], $item[1], $item[2], $item[3], $item[4], $item[5], $item[6], $item[7], $item[8], $item[9],
            $count, $item[11], $item[12], $null , $xyzc.tostring("0.00"), $item[15], $item[16], $item[17], $item[18], $item[19],
            $item[20], $item[21], 'INTERCOMPANY_AFFILIATE_ID', $offsetbank)       
    }

    if ($tmpdb -ne 0 -and $tmpcr -ne 0) { 
        $xyzc = [double]$tmpcr.ToString("0.00")
        $writer.WriteLine("{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12},{13},{14},{15},{16},{17},{18},{19},{20},{21},{22},{23}",
            $item[0], $item[1], $item[2], $item[3], $item[4], $item[5], $item[6], $item[7], $item[8], $item[9],
            $count, $item[11], $item[12], $null , $xyzc.tostring("0.00"), $item[15], $item[16], $item[17], $item[18], $item[19],
            $item[20], $item[21], 'INTERCOMPANY_AFFILIATE_ID', $offsetbank)       
    }

    $count = $count + 1
    #    }
    $lastjourn = $item[0]
     
    }
    }

    catch {
        Write-Log -Message $_.Exception.Message  -Path $scriptlog -Colorf Black -Colorb Yellow
    }
    finally {
    $reader.Close()
    $writer.Close()
    }

Here is a log output showing timestamps for each section.

2022-06-13 13:47:01 INFO: -------------------------------------------------------------------
2022-06-13 13:47:01 INFO: Program Begins
2022-06-13 13:47:01 INFO: -------------------------------------------------------------------
2022-06-13 13:47:01 INFO: Loading C:\ICAccounts.csv into internal table
2022-06-13 13:47:01 INFO: Reading from C:\JAN21.end.CSV to accumulate totals
2022-06-13 13:47:01 INFO: Counting records in C:\JAN21.end.CSV
2022-06-13 13:47:03 INFO: record count is: 84853
2022-06-13 13:47:03 INFO: Reading from C:JAN21.end.CSV to accumulate totals
2022-06-13 13:50:34 INFO: Calculating Percentages in internal table
2022-06-13 13:50:35 INFO: Reading from C:\JAN21.end.CSV again to calculate totals...
2022-06-13 13:50:35 INFO: ...and writing records to C:\JAN21.end-06132022-134701.fix.csv
2022-06-13 13:52:12 INFO: Process Complete
2022-06-13 13:52:12 INFO: Script Elapsed Time (HH:MM:SS) 00:05:10

First loop took ~3.5 minutes.

Second loop tool ~1.5 minutes

EDIT2: @zett42, this sounds like an intriguing idea. However, I've never mixed C# and PS before. I tried to replace the string code in the first while loop with something like this, but it cannot see the PS variables. How can I get PS and PS to both see the Strings and array?

$id = get-random
$code = @"
using System;
namespace HelloWorld
{
    public class Program$ID
    {
        public static void Main(){

itco = item[2];
itkey = item[19];
if (item2.Type == "m2m" && itco != "LE0001")
{
    itco = "LE0004";
}
if (itkey == "109495")
{
    itkey = "109479";
}
if (itkey == "172574")
{
    itkey = "172697";
}
if (itkey == "172718")
{
    itkey = "172590";
}
thisrecord = itkey + itco;
if (item2.icac == "202884" || item2.icac == "103886")
{
}
if (thisrecord == item2.ICKey)
{
    ttot = (Double)ttot + (Double)item[14] - (Double)item[13];
    if (itkey == "202884" || itkey == "103886")
    {
    }
}
if (itkey == item2.offsetac)
{
    toff = (Double)toff + (Double)item[14] - (Double)item[13];
}
}
}
}

"@

I am calling this code with:

    while ($null -ne ($line = $reader.ReadLine())) {
        Add-Type -TypeDefinition $code -Language CSharp 
        iex "[HelloWorld.Program]::Main()"    
    }

Thanks, again.

For many 'simple' tasks PowerShell does a lot of work behind the scenes (e.g. generating objects) to make later processing much easier, but this can slow things down, increase memory, etc. For example, reading large files with `Get-Content` is known to scale poorly; [using streams](https://stackoverflow.com/a/47352340/9529842) is a better alternative. Without seeing your code, it's hard to say what might help. — boxdog, Jun 14 '22 at 14:13
Can you use WSL? Or do you have access to a Linux box? Or can you upgrade from Windows to Linux? Or can you install a Windows version of `awk/gawk`? It will typically process multi-million line CSVs in seconds. — Mark Setchell, Jun 14 '22 at 15:11
Unfortunately, no on all. This is being processed in a very locked-down environment, unfortunately. — toddtmw, Jun 14 '22 at 15:24
There are a lot of "low level" string operations, comparisons and logic for each line of text. This is where PowerShell tends to be slowest compared to non-scripting languages. I suggest to rewrite the core stuff (the line-by-line processing) in C#. I've seen very big improvements for even less complex text-processing code, where the C# code already run faster by a factor of 10(!). You can embed C# code in PS scripts using `Add-Type -TypeDefinition $source` or `Add-Type -Path source.cs`. — zett42, Jun 14 '22 at 15:35
Create a class with a static function, pass the data as arguments and return the results (or modify reference parameters). I would propably do the whole `while ($null -ne ($line = $reader.ReadLine())) { }` loop in C#. — zett42, Jun 14 '22 at 16:23
@MarkSetchell [off-topic] `upgrade from Windows to Linux` - THAT's funny :) — Vlad Feinstein, Jun 14 '22 at 17:48
You don't need to calculate ```$itkey``` for every iteration of the inner loop - if you give your variables meaningful names - e.g. instead of ```$item``` call it ```$outer_item```, and instead of ```$itkey``` call it ```$outer_item_key``` and it'll be a bit easier to see optimisations like this... — mclayton, Jun 14 '22 at 19:28
The same with ```[Double]$item[14]``` and ```[Double]$item[13]``` - convert these in the outer loop rather than for every iteration of the inner loop. — mclayton, Jun 14 '22 at 19:40
My thoughts 1. "so I have to process the whole thing each time" - can you refactor this code and cache the data? (a hashtable or such). Moving any code outside a big loop (let alone removing a nested loop) will make a noticeable difference. 2. Have you considered "chunking" your input file, processing the pieces in parallel (separate powershell.exe processes) then combining combing the results at the end? 3. `Import-Csv` produces big bloated objects, theres several good 3rd party dotnet csv libraries that can be driven from powershell. or even load your data into SQLite and query it in memory? — MisterSmith, Jun 14 '22 at 20:45
@mclayton I'm actually doing some manipulation of $itkey that I removed from this code to make it simpler to read. there are three cases where if $itkey equals one value, I make it a different value. (there are two pairs of accounts that need to be treated as one.) But you bring up an excellent point about not having to do it every time. I can see now, that this is extremely wasteful. I will make that change and let you know how it goes. — toddtmw, Jun 15 '22 at 20:19
@mclayton also a good suggestion. It isn't happening every loop, but it is happening at least twice for each input record (more when there are duplicates) so I will make that change and let you know how it goes. — toddtmw, Jun 15 '22 at 20:20
@MisterSmith 1. It's weird. I started with a hash table and it actually seemed to be slower. I am going to move as much as I can outside of at least the inner loop based on other suggestions. 2. I have tried physically breaking up the files (like 10 days per file instead of 30) and then running three at a time on the server. That is "faster" from a wall-clock perspective, but it is a little more work to manage. But this is for a conversion, so it will not be something that will be done forever. 3. The environment I am working in is very locked down. I will see if I can get approval. — toddtmw, Jun 15 '22 at 20:25
Unrelated to performance, but from a maintainability perspective I'd also maybe create constants for all of your array indices - e.g. ```$idx_company_reference_id = 2``` and then your code can read ```$itco = $item[$idx_company_reference_id]```. You'd need to measure if that slows things down though... — mclayton, Jun 15 '22 at 20:25

score 3 · Answer 1 · answered Jun 16 '22 at 13:56

The script running with data that took almost five minutes now runs in 18 seconds!!

The 8 hour data file takes less than 14 minutes to run (And 2 of that is just counting the records in the file so I can build a progress bar)

Before any changes:

Script Elapsed Time (HH:MM:SS) 00:04:54

After moving as much as I could outside the inner loop:

Script Elapsed Time (HH:MM:SS) 00:02:56

After using the hash table:

Script Elapsed Time (HH:MM:SS) 00:00:18

Thank you to @mistersmith, their suggestion of a hash table made me rethink the loop and I used a hash table to tell if the loop was needed. That had the largest impact.

Also, thank you to @mclayton. The recommendation to move parts outside the inner loop by itself cut run time about in half. However, with the hash table, the loop is being executed significantly less, so the impact is somewhat reduced, but it was still an excellent suggestion and a reminder to always do things as few times as you can get away with.

The code I added was:

Read the list into a hash table separate from the table I was using. (I know I could have just used the hash table for everything, but the table is small and this required less overall code changes to what I already wrote):

$ICACList = @{}
Import-Csv -Path $Path2 | ForEach-Object { $ICACList[$_.'ICAC'] = $_ }

Then, I simply added an if statement before each of the inner while loops:

        if ($icaclist[$itkey]) {
            foreach ($item2 in $ICList) {

    if ($icaclist[$itkey]) {
        foreach ($item2 in $ICList) {

Turns out the mix of work on these files, there really are very few items that qualified for the inner loop, so reducing that had a huge impact.

Thank you to everyone who responded. It really was a combination of suggestions that helped me greatly improve this script and I learned a lot in the process.

PowerShell Slow processing large files

1 Answers1