1

I have a working script in PowerShell:

$file = Get-Content -Path HKEY_USERS.txt -Raw

foreach($line in [System.IO.File]::ReadLines("EXCLUDE_HKEY_USERS.txt"))
{
    $escapedLine = [Regex]::Escape($line)
    $pattern = $("(?sm)^$escapedLine.*?(?=^\[HKEY)")
    
    $file -replace $pattern, ' ' | Set-Content HKEY_USERS-filtered.txt
    $file = Get-Content -Path HKEY_USERS-filtered.txt -Raw
}

For each line in EXCLUDE_HKEY_USERS.txt it is performing some changes in file HKEY_USERS.txt. So with every loop iteration it is writing to this file and re-reading the same file to pull the changes. However, Get-Content is notorious for memory leaks, so I wanted to refactor it to StreamReader and StreamWriter, but I'm a having a hard time to make it work.

As soon as I do:

$filePath = 'HKEY_USERS-filtered.txt';
$sr = New-Object IO.StreamReader($filePath);
$sw = New-Object IO.StreamWriter($filePath);

I get:

New-Object : Exception calling ".ctor" with "1" argument(s): "The process cannot access the file 
'HKEY_USERS-filtered.txt' because it is being used by another process."

So it looks like I cannot use StreamReader and StreamWriter on same file simultaneously. Or can I?

van_folmert
  • 4,257
  • 10
  • 44
  • 89
  • 2
    Note that `[System.IO.File]::ReadLines( )` is the way to go in this case. `Get-Content` without the `-Raw` switch doesn't lead to memory leaks it just adds, honestly uneeded, ETS properties to each line (each object of the file) that's why it's slow. – Santiago Squarzon Mar 20 '22 at 01:54
  • 1
    Adding to my previous comment, there is no performance difference between `[System.IO.File]::ReadAllText( )` and `Get-Content -Raw` – Santiago Squarzon Mar 20 '22 at 02:07
  • 1
    As an aside: It's best to pseudo method syntax: Instead of `New-Object SomeType(arg1, ...)`, use `New-Object SomeType [-ArgumentList] arg1, ...` - PowerShell cmdlets, scripts and functions are invoked like _shell commands_, not like _methods_. That is, no parentheses around the argument list, and _whitespace_-separated arguments (`,` constructs an _array_ as a _single argument_, as needed for `-ArgumentList`). However, method syntax _is_ required if you use the PSv5+ `[SomeType]::new()` constructor-call method. See [this answer](https://stackoverflow.com/a/50636061/45375) – mklement0 Mar 20 '22 at 02:32
  • @SantiagoSquarzon `ReadLines` indeed consumes much less memory than `Get-Content`, but it has a problem with [regexes](https://stackoverflow.com/questions/71544930/cannot-remove-text-between-two-strings-with-readlines). – van_folmert Mar 20 '22 at 07:50

1 Answers1

3

tl;dr

  • Get-Content -Raw reads a file as a whole and is fast and consumes little unwanted memory.

  • [System.IO.File]::ReadLines() is a faster and more memory-efficient alternative to line-by-line reading with Get-Content (without -Raw), but you need to ensure that the input file is passed as a full path, because .NET's working directory usually differs from PowerShell's.

    • Convert-Path resolves a given relative path to a full, file-system-native one.

    • A PowerShell-native alternative to using [System.IO.File]::ReadLines() is the switch statement with the -File parameter, which performs similarly well while avoiding the working-directory discrepancy pitfall, and offers additional features.

  • There is no need to save the modified file content to disk after each iteration - just update the $file variable, and, after exiting the loop, save the value of $file to the output file.

$fileContent = Get-Content -Path HKEY_USERS.txt -Raw

# Be sure to specify a *full* path.
$excludeFile = Convert-Path -LiteralPath 'EXCLUDE_HKEY_USERS.txt'

foreach($line in [System.IO.File]::ReadLines($excludeFile)) {
    $escapedLine = [Regex]::Escape($line)
    $pattern = "(?sm)^$escapedLine.*?(?=^\[HKEY)"
    # Modify the content and save the result back to variable $fileContent
    $fileContent = $fileContent -replace $pattern, ' '
}

# After all modifications have been performed, save to the output file
$fileContent | Set-Content HKEY_USERS-filtered.txt

Building on Santiago Squarzon's helpful comments:

  • Get-Content does not cause memory leaks, but it can consume a lot of memory that isn't garbage-collected until an unpredictable later point in time.
    • The reason is that - unless the -Raw switch is used - it decorates each line read with PowerShell ETS (Extended Type System) properties containing metadata about the file of origin, such as its path (.PSPath) and the line number (.ReadCount).
    • This both consumes extra memory and slows the command down - GitHub issue #7537 asks for a way to opt out of this wasteful decoration, as it typically isn't needed.
    • However, reading with -Raw is efficient, because the entire file content is read into a single, multi-line string, which means that the decoration is only performed once.

So it looks like I cannot use StreamReader and StreamWriter on same file simultaneously. Or can I?

No, you cannot. You cannot simultaneously read from a file and overwrite it.

To update / replace an existing file you have two options (note that, for a fully robust solution, all attributes of the original file (except the last write time and size) should be retained, which requires extra work):

  • Read the old content into memory in full, perform the desired modification in memory, then write the modified content back to the original file, as shown in the top section.

    • There is a slight risk of data loss, however, namely if the process of writing back to the file gets interrupted.
  • More safely, write the modified content to a temporary file and, upon successful completion, replace the original file with the temporary one.

mklement0
  • 382,024
  • 64
  • 607
  • 775
  • For me `Get-Content` wastes 4+ GB of RAM (even with your code optimization), whereas `[System.IO.File]::ReadLines` at peak took only 1.5 GB RAM and memory was freed from PowerShell ISE process once the script stopped, which wasn't the case with `Get-Content`. – van_folmert Mar 20 '22 at 06:16
  • 1
    @van_folmert, you can't compare the two because `Get-Content -Raw` reads the _entire file_ into a single string as `[System.IO.File]::ReadAllText()` would - which you need, since you want to match across line boundaries. The memory isn't reclaimed until some time after the `$fileContent` variable goes out of scope or is manually removed. By contrast, `[System.IO.File]::ReadLines()` reads _line by line_, _lazily_ (as `Get-Content` without `-Raw` does _in the pipeline) with much more overhead), and in a loop the strings created may become eligible for garbage collection after each iteration. – mklement0 Mar 20 '22 at 10:38