0

I have this code that works like a charm for small files. It just dumps the whole file into memory, replaces NUL and writes back to the same file. This is not really very practical for huge files when file size is larger than the available memory. Can someone help me convert it to a streaming model such that it won't choke for huge files.

Get-ChildItem -Path "Drive:\my\folder\path" -Depth 2 -Filter *.csv | 
Foreach-Object {
$content = Get-Content $_.FullName
#Replace NUL and save content back to the original file
$content -replace "`0","" | Set-Content $_.FullName
}
APS
  • 1
  • 2
    What research efforts have you undertaken thus far? – Bill_Stewart Apr 20 '21 at 17:27
  • What's up with the replace pattern? You are escaping the 0. if your intent is to replace the zero it might not work. Let me know and I'll update my answer accrdingly. – Steven Apr 20 '21 at 18:03
  • 1
    @steven backtick0 in powershell is ASCII 0 / NUL character in the CSV file, that I am trying to replace with empty string. – APS Apr 20 '21 at 22:06
  • I just thought of this, but it might be a better practice to try and match `\0`. I know in other RegEx flavors that'll match a NUL. Typical advice in PowerShell is to use the RegEx metacharacters for operators like `-replace` & `-split`. – Steven May 02 '21 at 16:42

2 Answers2

1

The way you have this structured the entire file contents have to be read into memory. Note: That reading a file into memory uses 3-4x the file size in RAM, which's documented here.

Without getting into .Net classes, particularly [System.IO.StreamReader], Get-Content is actually very memory efficient, you just have to leverage the pipeline so you don't build up the data in memory.

Note: if you do decide to try StreamReader, the article will give you some syntax clues. Moreover, that topic has been covered by many others on the web.

Get-ChildItem -Path "C:\temp" -Depth 2 -Filter *.csv | 
ForEach-Object{
    $CurrentFile = $_
    $TmpFilePath = Join-Path $CurrentFile.Directory.FullName ($CurrentFile.BaseName + "_New" + $CurrentFile.Extension)
    
    Get-Content $CurrentFile.FullName |
    ForEach-Object{ $_ -replace "`0","" } |
    Add-Content $TmpFilePath 

    # Now that you've got the new file you can rename it & delete the original:
    Remove-Item -Path $CurrentFile.FullName
    Rename-Item -Path $TmpFilePath -NewName $CurrentFile.Name
} 

This is a streaming model, Get-Content is streaming inside the outer ForEach-Object loop. There may be other ways to do it, but I chose this so I could keep track of the names and do the file swap at the end...

Note: Per the same article, in terms of speed Get-Content is quite slow. However, your original code was likely already suffering that burden. Moreover, you can speed it up a bit using the -ReadCount XXXX parameter. That will send some number of lines down the pipe at a time. That of course does use more memory, so you'd have to find a level that helps you say within the boundaries of your available RAM. Performance improvement with -ReadCount is mentioned in this answer's comments.

Update Based on Comments:

Here's an example of using StreamReader/Writer to perform the same operations from the previous example. This should be just as memory efficient as Get-Content, but should be much faster.

Get-ChildItem -Path "C:\temp" -Depth 2 -Filter *.csv | 
ForEach-Object{
    $CurrentFile = $_.FullName
    $CurrentName = $_.Name
    $TmpFilePath = Join-Path $_.Directory.FullName ($_.BaseName + "_New" + $_.Extension)
    
    $StreamReader = [System.IO.StreamReader]::new( $CurrentFile )
    $StreamWriter = [System.IO.StreamWriter]::new( $TmpFilePath )

    While( !$StreamReader.EndOfStream )
    {
        $StreamWriter.WriteLine( ($StreamReader.ReadLine() -replace "`0","") )
    }
    
    $StreamReader.Close()
    $StreamWriter.Close()

    # Now that you've got the new file you can rename it & delete the original:
    Remove-Item -Path $CurrentFile
    Rename-Item -Path $TmpFilePath -NewName $CurrentName
} 

Note: I have some sense this issue is rooted in encoding. The Stream constructors do accept an encoding enum as an argument.

Available Encodings:

[System.Text.Encoding]::BigEndianUnicode
[System.Text.Encoding]::Default
[System.Text.Encoding]::Unicode
[System.Text.Encoding]::UTF32
[System.Text.Encoding]::UTF7
[System.Text.Encoding]::UTF8

So if you wanted to instantiate the streams with, for example, UTF8:

    $StreamReader = [System.IO.StreamReader]::new( $CurrentFile, [System.Text.Encoding]::UTF8 )
    $StreamWriter = [System.IO.StreamWriter]::new( $TmpFilePath, [System.Text.Encoding]::UTF8 )

The streams do default to UTF8. I think the system default is typically code page Windows 1251.

Steven
  • 6,817
  • 1
  • 14
  • 14
  • Thank you for your reply. I will explore the StreamReader since that appears to be the right way forward for very large files >= 50GB – APS Apr 20 '21 at 22:10
  • No pressure and I only point out because you're a new contributor, but if this answer helped get you to a solution and if your comfortable with it consider marking it answered using the checkmark to the left. – Steven Apr 22 '21 at 18:00
0

This would be the simplest way using the least memory, one line at a time, to another file. But it needs double the disk space.

get-content file.txt | % { $_ -replace "`0" } | set-content file2.txt 
js2010
  • 23,033
  • 6
  • 64
  • 66
  • Sorry to be critical, but I think this is included in (my answer)[https://stackoverflow.com/a/67184066/4749264]. I covered additional steps like renaming the files and also an alternate approach using Streams. – Steven Apr 21 '21 at 17:58