UTF-8 BOM to UTF-8 Conversion for a large file

Question

Based on the suggestion from this thread, i have used powershell to do the UTF-8 conversion, now i am running into another problem, i have a very huge file around 18 gb which i am trying to convert on a machine with around 50GB RAM free, but this conversion process eats up all the ram and encoding fails, is there a way to limit the RAM usage or to do the conversion in chunks?

Using PowerShell to write a file in UTF-8 without the BOM

BTW below is exact code

foreach ($file in ls -name $Path\CM*.csv)
{
   $file_content = Get-Content "$Path\$file";
   [System.IO.File]::WriteAllLines("$Path\$file", $file_content);
   
   echo "encoding done : $file"

}

Are you thinking of prepending the BOM? I hadn't thought of that but in any event would be selfishly interested in an example. — Steven, Apr 13 '21 at 12:15
Shouldn't you include `[System.Text.Encoding]::UTF8` in the command, like: `[System.IO.File]::WriteAllLines("$Path\$file", $file_content, [System.Text.Encoding]::UTF8)` I think otherwise it will be either ASCII or system default which is usually code page 1251. — Steven, Apr 13 '21 at 14:20

score 3 · Answer 1 · answered Apr 13 '21 at 11:57

3

Don't Store the file's content in memory. As noted here, doing so require 3-4 times the file size in RAM. Get-Content is slow but quite memory efficient. so a simple solution may be

Get-Content -Path <FilePath> | Out-File -FilePath <FilePath> -Encoding UTF8

Note: While I haven't tried this you may want to use Add-Content instead of Out-File. The latter will sometimes reformat according to console width. Characteristic of Out-* cmdlets they traverse the for-display formatting system.

Because the content is streamed down the pipe, only one line at a time is stored in RAM. .Net memory Garbage Collection is running in the background releasing and otherwise managing RAM.

Note: [System.IO.StreamReader] and [System.IO.StreamWriter] can probably also address this issue. They may be faster, and are just as memory efficient, but they come with a syntax burden that may not be worth it, particularly if this is a one-off... That said, you can instantiate them with a System.Text.Encoding enum, so theoretically can use them for the conversion.

answered Apr 13 '21 at 11:57

Steven

6,817
1
14
14

@MathiasR.Jessen Are there any specifics to doing so? Take a look at the article under the heading "Memory Considerations". I tested `-ReadCount 0` and a 100MB file needed 600 MB of RAM, granted that's reading into RAM. However, the description of `-ReadCount 0` suggests it will read the whole file before passing the chunk down the pipe. I did a quick re-test with `-ReadCount 1` to see if the meta-data stripping may improve performance but saw only marginal differences in either direction. All under 5.1. Perhaps playing with the argument sending different sized chunks down the pipe?? – Steven Apr 13 '21 at 12:11
No you're absolutely right, it's a terrible idea, my brain decided to write that comment before thinking it all the way through :-) – Mathias R. Jessen Apr 13 '21 at 13:01
Thanks for the response but this command is taking lot of time and producing an empty file for some reason – depak jan Apr 13 '21 at 13:38
It is not surprising it would take a long time. I'm not sure why it would produce an empty file. You aren't using the same file path on both sides of the pipe are you? I'm testing now and will let you know. Needless to say it took a bit to accumulate an 18GB file. – Steven Apr 13 '21 at 14:15
I tested with both `Add-Content` & `Out-File` they were both slow, the latter was worse ~ 60 minutes. However in both cases the resulting seemed file. In both cases I saw more memory utilization than expected, but I only have ~3GB free (too many Chrome tabs...) and neither approach ran out of mem or otherwise failed. That said, I mentioned and @Theo demonstrated the StreamReader/Writer approach. It's likely to be much faster than native cmdlets. – Steven Apr 13 '21 at 16:18
1

Note: `Out-File` was significantly faster when chunking down the pipeline with `-ReadCount 1000` 30 minutes versus 60. – Steven Apr 13 '21 at 17:13

score 3 · Answer 2 · answered Apr 13 '21 at 14:46

When you know that the input file is always UTF-8 with BOM, you only need to strip the first three bytes (the BOM) from the file.

Using a buffered stream, you only need to load a fraction of the file into memory.

For best performance I would use a FileStream. This is a raw binary stream and thus has the least overhead.

$streamIn = $streamOut = $null
try {
    $streamIn = [IO.FileStream]::new( $fullPathToInputFile, [IO.FileMode]::Open )
    $streamOut = [IO.FileStream]::new( $fullPathToOutputFile, [IO.FileMode]::Create )

    # Strip 3 bytes (the UTF-8 BOM) from the input file
    $null = $streamIn.Seek( 3, [IO.SeekOrigin]::Begin )

    # Copy the remaining bytes to the output file
    $streamIn.CopyTo( $streamOut )

    # You may try a custom buffer size for better performance:
    # $streamIn.CopyTo( $streamOut, 1MB )
}
finally {
    # Make sure to close the files even in case of an exception
    if( $streamIn ) { $streamIn.Close() }
    if( $streamOut ) { $streamOut.Close() }
}

You may experiment with the FileStream.CopyTo() overload that has a bufferSize parameter. In my experience, a larger buffer size (say 1 MiB) can improve performance considerably, but when it is too large, performance will suffer again because of bad cache use.

Nice! If you look at my other comments I was waiting for someone to demo that approach. Thanks! — Steven, Apr 13 '21 at 18:03

Theo · Accepted Answer · 2021-04-13T13:55:38.180

0

You can use a StreamReader and StreamWriter to do the conversion.

The StreamWriter by default outputs UTF8NoBOM.

This will take a lot of disk actions, but will be lean on memory.

Bear in mind that .Net needs full absolute paths.

$sourceFile      = 'D:\Test\Blah.txt'  # enter your own in- and output files here
$destinationFile = 'D:\Test\out.txt'

$reader = [System.IO.StreamReader]::new($sourceFile, [System.Text.Encoding]::UTF8)
$writer = [System.IO.StreamWriter]::new($destinationFile)

while ($null -ne ($line = $reader.ReadLine())) {
    $writer.WriteLine($line)
}
# clean up
$writer.Flush()
$reader.Dispose()
$writer.Dispose()

The above code will add a final newline to the output file. If that is unwanted, do this instead:

$sourceFile      = 'D:\Test\Blah.txt'
$destinationFile = 'D:\Test\out.txt'

$reader = [System.IO.StreamReader]::new($sourceFile, [System.Text.Encoding]::UTF8)
$writer = [System.IO.StreamWriter]::new($destinationFile)

while ($null -ne ($line = $reader.ReadLine())) {
    if ($reader.EndOfStream) {
        $writer.Write($line)
    }
    else {
        $writer.WriteLine($line)
    }
}
# clean up
$writer.Flush()
$reader.Dispose()
$writer.Dispose()

edited Apr 13 '21 at 13:55

answered Apr 13 '21 at 13:42

Theo

57,719
8
24
41

Am I wrong that the `.Close()` calls `.Dispose()` and `.Flush()` under the hood? Based on documentation I always thought so. – Steven Apr 13 '21 at 13:59
@Steven Never really could figure that out exactly from [the docs](https://learn.microsoft.com/en-us/dotnet/api/system.io.streamwriter.dispose?view=net-5.0). I always thought `.Dispose()` closes the underlying stream and releases the unmanaged resources. As for `.Flush()`, it says _Causes any buffered data to be written to the underlying stream_, so in that regard, the line `$writer.Flush()` could be superfluous. Don't think it will hurt though.. – Theo Apr 13 '21 at 14:05
I do agree the caution never hurts. Thanks for the input! – Steven Apr 13 '21 at 14:13
1

Thank you, this method worked great, it does not consume any memory at all!!! – depak jan Apr 13 '21 at 17:31

UTF-8 BOM to UTF-8 Conversion for a large file

3 Answers3

Linked