5

I am doing some file clean up before loading into my data warehouse and have run into a file sizing issue:

(Get-Content -path C:\Workspace\workfile\myfile.txt -Raw) -replace '\\"', '"' | Set-Content C:\Workspace\workfile\myfileCLEAN.txt

My file is about 2GB. I am receiving the following error and not sure how to correct.

Get-Content : Exception of type 'System.OutOfMemoryException' was thrown, ........

I am NOT a coder, but I do like learning so am building my own data warehouse. So if you do respond, keep my experience level in mind :)

Philip Wrage
  • 1,505
  • 1
  • 12
  • 23
  • don't load the whole file ... load it line-by-line and do your replace on each line. then use `Add-Content` to send the result to a new file. – Lee_Dailey Nov 20 '20 at 21:26

3 Answers3

2

Get-Content -Raw makes PowerShell read the entire file into a single string.

.NET can't store individual objects over 2GB in size in memory, and each character in a string takes up 2 bytes, so after reading the first ~1 billion characters (roughly equivalent to a 1GB ASCII-encoded text file), it reaches the memory limit.

Remove the -Raw switch, -replace is perfectly capable of operating on multiple input strings at once:

(Get-Content -path C:\Workspace\workfile\myfile.txt) -replace '\"', '"' | Set-Content C:\Workspace\workfile\myfileCLEAN.txt

Beware that -replace is a regex operator, and if you want to remove \ from a string, you need to escape it:

(Get-Content -path C:\Workspace\workfile\myfile.txt) -replace '\\"', '"' | Set-Content C:\Workspace\workfile\myfileCLEAN.txt

While this will work, it'll still be slow due to the fact that we're still loading >2GB of data into memory before applying -replace and writing to the output file.

Instead, you might want to pipe the output from Get-Content to the ForEach-Object cmdlet:

Get-Content -path C:\Workspace\workfile\myfile.txt |ForEach-Object {
  $_ -replace '\\"','"'
} |Set-Content C:\Workspace\workfile\myfileCLEAN.txt

This allows Get-Content to start pushing output prior to finishing reading the file, and PowerShell therefore no longer needs to allocate as much memory as before, resulting in faster execution.

Mathias R. Jessen
  • 157,619
  • 12
  • 148
  • 206
  • This does work but is slow with a 2GB file taking almost an hour to process. Any hints on helping to speed up the process? – Susan Melton Nov 30 '20 at 21:59
2
  • A performant way of reading a text file line by line - without loading the entire file into memory - is to use a switch statement with the -File parameter.

  • A performant way of writing a text file is to use a System.IO.StreamWriter instance.

  • As Mathias points out in his answer, using verbatim \" with the regex-based -replace operator actually replaces " alone, due to the escaping rules of regexes. While you could address that with '\\"', in this case a simpler and better-performing alternative is to use the [string] type's Replace() method, which operates on literal substrings.

To put it all together:

# Note: Be sure to use a *full* path, because .NET's working dir. usually
#       differs from PowerShell's.
$streamWriter = [System.IO.StreamWriter]::new('C:\Workspace\workfile\myfileCLEAN.txt')

switch -File C:\Workspace\workfile\myfile.txt {
  default { $streamWriter.WriteLine($_.Replace('\"', '"')) }
}

$streamWriter.Close()

Note: If you're using an old version of Windows PowerShell, namely version 4 or below, use
New-Object System.IO.StreamWriter 'C:\Workspace\workfile\myfileCLEAN.txt'
instead of
[System.IO.StreamWriter]::new('C:\Workspace\workfile\myfileCLEAN.txt')

mklement0
  • 382,024
  • 64
  • 607
  • 775
  • 1
    Thank you for pointing that out. I upgraded to v5 and wow! this works great and is super fast :) Thank you so much! – Susan Melton Dec 01 '20 at 16:38
  • **mklement0** - I'm working on some comparisons and will write a proper question soon, but is `switch` w/ `-File` the most performant way of reading? Interesting we aren't doing a `StreamReader` to complement the writer. I need to use `ConvertFrom-Csv` in my case, so I'm still working on figuring out what will be the best to measure, but thought I'd ask while I'm working on it – immobile2 Apr 05 '22 at 09:49
  • @immobile2, among native PowerShell features, `switch -File` is the fastest method. I'm not certain, but I think its performance is comparable to a `StreamReader` loop or a `[IO.File]::ReadLines()` loop. The use of .NET methods is less obvious to PowerShell users, and there's the pitfall of the working-directory mismatch. Also, `switch` offers sophisticated filtering functionality (not needed here). – mklement0 Apr 05 '22 at 12:47
  • Thanks! It is actually slower than my piping at the moment, but I think that is the `StreamWriter` for some reason. I'm going to go with an `[IO.File]` method though because my `switch` statement right now is needlessly doing a wildcard compare to identify the header and then stripping the header off later before writing to file – immobile2 Apr 06 '22 at 17:24
0

Get-Content loads the whole file into memory.

Try processing line by line to improve memory utilization.

$infile = "C:\Workspace\workfile\myfile.txt"
$outfile = "C:\Workspace\workfile\myfileCLEAN.txt"

foreach ($line in [System.IO.File]::ReadLines($infile)) {
    Add-Content -Path $outfile -Value ($line -replace '\\"','"')
}
Philip Wrage
  • 1,505
  • 1
  • 12
  • 23
  • 1
    `Get-Content` only loads the whole file into memory if you use the `-Raw` switch or enclose the call in `(...)`. While `[System.IO.File]::ReadLines()` is an efficient way to read a file, you're negating the performance benefits by using `Add-Content` in a loop, which opens and closes the output file in each iteration. – mklement0 Nov 20 '20 at 23:37
  • This works but is as slow as using Get-Content. Only advantage I saw was less taxing on memory and I could more easily monitor the progress. – Susan Melton Nov 30 '20 at 22:02