1

I am writing a simple script (as I thought) to replace some strings in CSV files. Those strings are so called "keys" of objects. I basically replace the "old key" in the files with a "new key".

function simpleStringReplacement {
    param (
        $sourceFiles,  # list of csv files in which we do need to replace contents
        $mappingList,  # a file that contains 2 columns: The old key and the new key
        $exportFolder, # folder where i expect the results
        $FieldsToSelectFromTargetFilesIntoMappingFile # As the names of the fields that contain the values for replacements change, i have that in this array
    )
    $totalitems = $sourceFiles.count
    $currentrow = 0
    Write-Output "Importing mapper file $mappingList" | logText
    $findReplaceList = Import-Csv -Path $mappingList -Delimiter   ';'
    foreach ($sourceFile in $sourceFiles) {
        $currentrow += 1
        Write-Output "Working on  $currentrow : $sourceFile" | logText
        [string] $txtsourceFile = Get-Content $sourceFile.FullName | Out-String
        $IssueKey = $FieldsToSelectFromTargetFilesIntoMappingFile[0]
        $OldIssueKey = $FieldsToSelectFromTargetFilesIntoMappingFile[1]

 ForEach ($findReplaceItem in $findReplaceList) {
          $txtsourceFile = $txtsourceFile -replace  $findReplaceitem.$OldIssueKey , $findReplaceitem.$IssueKey
        }
        $outputFileName = $sourceFile.Name.Substring(0, $sourceFile.Name.IndexOf('.csv') ) + "_newIDs.csv"
        $outputFullFileName =Join-Path -Path $exportFolder -ChildPath $outputFileName
        Write-Output "Writing result to  $currentrow : $outputFullFileName" | logText
        $txtsourceFile | Set-Content -path $outputFullFileName
    }
}

The issue I have: already when the script is working on the first file (first iteration of the outer loop) i get:

Insufficient memory to continue the execution of the program.

And this error is referencing my code line with the replacement:

$txtsourceFile = $txtsourceFile -replace  $findReplaceitem.$OldIssueKey , $findReplaceitem.$IssueKey

The csv files are "big" but really not that big..
The mappingList is 1.7 MB Each Source File is around 1.5 MB

I can't really understand how i run into memory issues with these file sizes. And ofc. I have no idea how to avoid that problem

I found some blogs talking about memory issues in PS. They all end up changing the PowerShell MaxMemoryPerShellMB quota defaults. That somehow doesn't work at all for me as I run into an error with

get-item WSMAN:\localhost\shell\MaxMemoryPerShellMB

Saying "get-item : Cannot find path 'WSMan:\localhost\Shell\MaxMemorPerShellMB' because it does not exist."

I am working in VS Code.

EKortz
  • 25
  • 7
  • short update: If i check the sysem memory consumption during execution: Process Windows Powershell takes up to 3.2 GB before it is stopped with the exception.. – EKortz Oct 30 '19 at 18:28
  • How many issue keys might there be in the `$mappingList` file? And for a given `$sourceFile` how many of its keys might be remapped? Though both files are less than a mere 2 MB, every time the error line you referenced results in a change it will produce a slightly different but still entirely new `[String]` object representing the complete source file. If you have, say, 10,000 mappings defined and 1,000 of them are found in the source file, that's 1,000 × 1.7 MB = 1.7 GB of garbage to collect. The math gets worse if the mappings are shorter but greater in number. – Lance U. Matthews Oct 30 '19 at 19:38
  • @BACON is suggesting the same thing I was thinking, but I don't know enough about gc in PowerShell. Are you sure it was only in the copy? The same misspelling is in the error message? – Mark Oct 30 '19 at 19:40
  • Also, when you say you're replacing "keys", are you remapping entire column (cell) values, or is it arbitrary search text that could be a substring of a value (like a profanity filter)? This could be processed line-by-line using a `[Hashtable]`/`[Dictionary]` to perform the mappings, which should greatly reduce the run-time as well as memory usage, but it would require that entire values are being replaced. – Lance U. Matthews Oct 30 '19 at 19:57

2 Answers2

1

As @BACON alludes, the core issue here is caused by looping through (likely) several thousand replacements.

Every time the replacement line executes:

$txtsourceFile = $txtsourceFile -replace  $findReplaceitem.$OldIssueKey , $findReplaceitem.$IssueKey

PowerShell first has a chunk of memory for the $txtsourceFile. It allocates a new chunk of memory to store a copy of the data after the text replacements.

This is normally "ok" as you will have one valid chunk of memory with the replacement text, and an "invalid" copy with the original text. Since most people have (relatively) lots of memory, and we normally can handle this "leaking" in .NET by periodically running a garbage collector in the background to "clean up" this invalid data.

The trouble we run into is that when we loop several thousand times rapidly, we generate several thousand copies of the data rapidly as well. You eventually run out of available free memory before the Garbage Collector has a chance to run and clean up the thousands of invalid copies of data (i.e. 3.2GB). See: No garbage collection while PowerShell pipeline is executing

There are a couple of ways to work around this:

Solution 1: The Big and Slow Method and Inefficient Way

If you need to work with the whole file (i.e. across newlines) you can use the same code and manually run the Garbage Collector periodically during the execution to manage the memory "better":

$count = 0

ForEach ($findReplaceItem in $findReplaceList) {
    $txtsourceFile = $txtsourceFile -replace  $findReplaceitem.$OldIssueKey, $findReplaceitem.$IssueKey

    if(($count % 200) -eq 0)
    {
        [System.GC]::GetTotalMemory('forceFullCollection') | out-null
    }
    $count++
}

This does 2 things:

  1. Run the Garbage Collection every 200 loops ($count modulus 200).
  2. Stop the current execution and force the collection.

Note:

Normally you use:

[GC]::Collect()

But according to Addressing the PowerShell Garbage Collection bug at J House Consulting this doesn't always work when trying to force the collection inside a loop. Using:

[System.GC]::GetTotalMemory('forceFullCollection')

Fully stops execution until the Garbage collection is complete before resuming.

Solution 2: The Faster, More memory efficient method, one line at a time

If you can perform all the replacements one line at a time, then you can use the [System.IO.StreamReader] to stream in the file and process one line at a time and [System.IO.StreamWriter] to write it.

try
{
    $SR = New-Object -TypeName System.IO.StreamReader -ArgumentList $sourceFile.FullName
    $SW = [System.IO.StreamWriter] $outputFullFileName

    while ($line = $SR.ReadLine()) {
        #Loop through Replacements
        ForEach ($findReplaceItem in $findReplaceList) {
            $Output = $line -replace  $findReplaceitem.$OldIssueKey, $findReplaceitem.$IssueKey
        }
        $SW.WriteLine($output)
    }

    $SR.Close() | Out-Null
    $SW.Close() | Out-Null
}
finally
{
    #Cleanup
    if ($SR -ne $null)
    {
        $SR.dispose()
    }
    if ($SW -ne $null)
    {
        $SW.dispose()
    }
}

This should run an order of magnitude faster because you will be working a line at a time and won't be creating thousands of copies of the entire file with every replacement.

HAL9256
  • 12,384
  • 1
  • 34
  • 46
  • I found the answer and comments above very helpful and implemented a soluion that is close to the answer here: I split the $findReplaceList in multiple batches (it is around 37000 entries long, i started splitting into 1000) and work on bath by batch with GC in-between. – EKortz Nov 01 '19 at 08:13
0

I found the answer and comments above very helpful and implemented a solution that is close to the answer here: I split the $findReplaceList in multiple batches (it is around 37000 entries long, i started splitting into 1000) and work on bath by batch with GC in-between. Now i can watch the memory usage climb up during a batch and jump down again when one is done.

With that I found an interesting behavior: the memory issue came still up in a few of batches... So I analysed the findReplaceList further with the following result:

There are cases where there are NO $OldIssueKey in the file..

Can it be that PS then sees that as an empty string and tries to replace all those?

EKortz
  • 25
  • 7
  • That produces a very interesting result! Matching on an empty string: `"abc" -replace "","z"` returns `zazbzcz` it looks like it matches every single character (including the EOL character) and replaces it with the replacement text plus the existing character. So if you have a huge file, it would definitely encounter some additional memory issues if it matches every character. – HAL9256 Nov 01 '19 at 18:10