1

I have some text files with different encodings. Some of them are UTF-8 and some others are windows-1251 encoded. I tried to execute following recursive script to encode it all to UTF-8.

Get-ChildItem *.nfo -Recurse | ForEach-Object {
$content = $_ | Get-Content

Set-Content -PassThru $_.Fullname $content -Encoding UTF8 -Force}  

After that I am unable to use files in my Java program, because UTF-8 encoded has also wrong encoding, I couldn't get back original text. In case of windows-1251 encoded files I get empty output as in case of original files. So it makes corrupt already UTF-8 encoded files.

I found another solution, iconv, but as I see it needs current encoding as parameter.

$ iconv options -f from-encoding -t to-encoding inputfile(s) -o outputfile 

Differently encoded files are mixed in a folder structure, so files should stay on same path.

System uses Code page 852. Existing UTF-8 files are without BOM.

plaidshirt
  • 5,189
  • 19
  • 91
  • 181

1 Answers1

1

In Windows PowerShell you won't be able to use the built-in cmdlets for two reasons:

  • From your OEM code page being 852 I infer that your "ANSI" code page is Windows-1250 (both defined by the legacy system locale), which doesn't match your Windows-1251-encoded input files.

  • Using Set-Content (and similar) with -Encoding UTF8 invariably creates files with a BOM (byte-order mark), which Java and, more generally, Unix-heritage utilities don't understand.

    • Update: There is a workaround: The New-Item cmdlet, when combined with the -Value parameter, (surprisingly) does create BOM-less UTF-8 files - see this answer.

Note: PowerShell (Core) 7+ now defaults to BOM-less UTF8 and also allows you to pass any available [System.Text.Encoding] instance to the -Encoding parameter, so you could solve your problem with the built-in cmdlets there.

You must therefore use the .NET framework directly:

Get-ChildItem *.nfo -Recurse | ForEach-Object {

  $file = $_.FullName

  $mustReWrite = $false
  # Try to read as UTF-8 first and throw an exception if 
  # invalid-as-UTF-8 bytes are encountered.
  try {
    [IO.File]::ReadAllText($file, [Text.Utf8Encoding]::new($false, $true))
  } catch [System.Text.DecoderFallbackException] {
    # Fall back to Windows-1251
    $content = [IO.File]::ReadAllText($file, [Text.Encoding]::GetEncoding(1251))
    $mustReWrite = $true
  } 

  # Rewrite as UTF-8 without BOM (the .NET frameworks' default)
  if ($mustReWrite) {
    Write-Verbose "Converting from 1251 to UTF-8: $file"
    [IO.File]::WriteAllText($file, $content)
  } else {
    Write-Verbose "Already UTF-8-encoded: $file"
  }

}

Note: As in your own attempt, the above solution reads each file into memory as a whole, but that could be changed.

Note:

  • If an input file comprises only bytes with ASCII-range characters (7-bit), it is by definition also UTF-8-encoded, because UTF-8 is a superset of ASCII encoding.

  • It is highly unlikely with real-world input, but purely technically a Windows-1251-encoded file could be a valid UTF-8 file as well, if the bit patterns and byte sequences happen to be valid UTF-8 (which has strict rules around what bit patterns are allowed where).
    Such a file would not contain meaningful Windows-1251 content, however.

  • There is no reason to implement a fallback strategy for decoding with Windows-1251, because there is no technical restrictions on what bit patterns can occur where.
    Generally, in the absence of external information (or a BOM), there's no simple and no robust way to infer a file's encoding just from its content (though heuristics can be employed).

mklement0
  • 382,024
  • 64
  • 607
  • 775
  • Thanks for your detailed answer! I tried to execute it from PowerShell ISE, but got no output, files are unchanged. Are there some other prerequisites too? – plaidshirt Nov 14 '18 at 09:23
  • @plaidshirt: What happens if you run `$VerbosePreference = 'Continue'` before executing the code? What does the verbose output then indicate? Perhaps the files have all already been converted? Generally, avoid repeated invocations from the ISE, because they happen in the same scope (are all "dot-sourced"), with previous invocations possibly producing unwanted side effects for subsequent ones. – mklement0 Nov 15 '18 at 02:42
  • Files aren't converted. I get absolutely no output, just a security warning when ps1 file is executed. I select "Run once" option and get back prompt without any messages. – plaidshirt Nov 15 '18 at 12:15
  • @plaidshirt: If you've set `$VerbosePreference = 'Continue'` and the script produces no output, I can think of only two explanations: you're accidentally running a _different_ script, or the `Get-ChildItem` command matches no files and the `ForEach-Object` block is never entered. – mklement0 Nov 15 '18 at 12:58
  • @plaidshirt: P.S.: Another possibility: you may be accidentally quietly ignoring a terminating error occurring in the script. – mklement0 Nov 15 '18 at 16:09
  • 1
    @plaidshirt: Set `$ErrorActionPreference = 'Stop'` before running the script and makes sure there's no similar statement that overrides it inside the script. Also, try running in a regular console window rather than in the ISE. Make sure that the working directory is the correct one. Remember to also set `$VerbosePreference = 'Continue'` – mklement0 Nov 16 '18 at 13:03