4

I have a text file input.xlf

  <trans-unit id="loco:5e7257a0c38e0f5b456bae94">
    &lt;source&gt;Login</source>
    <target>登入</target>
    <note>Login Header</note>
  </trans-unit>

Basically I need to replace &lt; with < and &gt; with '>', so I run below script

runner.bat

powershell -Command "(gc input.xlf) -replace '&lt;', '<' | Out-File -encoding ASCII output.xlf";
powershell -Command "(gc output.xlf) -replace '&gt;', '>' | Out-File -encoding ASCII  output.xlf";

The above was working until I noticed below as the output

  <trans-unit id="loco:5e7257a0c38e0f5b456bae94">
    <source>Login</source>
    <target>??????</target>
    <note>Login Header</note>
  </trans-unit>

I tried removing the encoding but now I get

 <trans-unit id="loco:5e7257a0c38e0f5b456bae94">
   <source>Login</source>
   <target>登入</target>
   <note>Login Header</note>  
 </trans-unit>

Below is my desired output

  <trans-unit id="loco:5e7257a0c38e0f5b456bae94">
    <source>Login</source>
    <target>登入</target>
    <note>Login Header</note>
  </trans-unit>
Mofi
  • 46,139
  • 17
  • 80
  • 143
Owen Kelvin
  • 14,054
  • 10
  • 41
  • 74

1 Answers1

7

There are (potentially) two character-encoding problems:

  • On output, using -Encoding Ascii is guaranteed to "lossily" transliterate any non-ASCII-range characters to literal ? characters.

    • To preserve all characters, you must choose a Unicode encoding, such as -Encoding Utf8
  • On input, you must ensure that the input file is correctly read by PowerShell.

    • Specifically, Windows PowerShell misinterprets BOM-less UTF-8 files as ANSI-encoded, so you need to use -Encoding Utf8 with Get-Content too.

Additionally, you can get away with a single powershell.exe call, and you can additionally optimize this call:

powershell -Command "(gc -Raw -Encoding utf8 input.xlf) -replace '&lt;', '<' -replace '&gt;', '>' | Set-Content -NoNewLine -Encoding Utf8 output.xlf"
  • Using -Raw with gc (Get-Content) reads the file as a whole instead of into an array of lines, which speeds up the -replace operations.

  • You can chain -replace operations

  • With input that is already text (strings), Set-Content is generally the faster choice.[1]
    -NoNewLine prevents an extra trailing newline from getting appended.


[1] It will make virtually no difference here, given that only a single string is written, but with many input strings (line-by-line output) it may - see this answer.

mklement0
  • 382,024
  • 64
  • 607
  • 775
  • 1
    This is very helpful given that this are translation files and they are quite big – Owen Kelvin Nov 02 '21 at 16:27
  • 1
    Glad to hear it, @OwenKelvin. Yes, `-Raw` makes a big difference. The only caveat is that the file's content must fit into memory as a whole (actually _three_ times here, given that each `-replace` operation creates a copy), but even large text files are quite likely to fit. (Your own approach without `-Raw` _also_ loads the entire file, due to enclosing the `Get-Content` call in `(...)`, though that could be transformed into a _streaming_ approach where each line at a time is processed and saved to the target file, without needing to store the entire file's content in memory at once.) – mklement0 Nov 02 '21 at 16:35