console command & output to unicode

Question

I know, this is an old question, but none of the answers I found helps in the following scenario:

fc /u TextA.txt TextB.txt

compares the two Unicode encoded txt files and displays the result correctly (!) on the screen.

As expected,

fc /u TextA.txt TextB.txt > Comp.txt

does not result in a Unicode encoded file.

Unfortunately the method used in similar situations

cmd /u /c fc /u TextA.txt TextB.txt > Comp.txt

does not work, the generated file is ANSI encoded.

I hope somebody here can help ...

EDITED (after first comments): The problem seems to be that cmd /u (or chcp) works only with "internal" commands (like dir). fc is not an internal command ... (Thanks to LotPings!)

@Fabre: The first example makes no difference, the second seems not to be correct syntax. -- I dont know how to post files here... But you could simply write in Notepad a word and save the file in Unicode, the same for the second file, and then look if the Comp.txt is Unicode or ANSI encoded. — newbieforever, Dec 13 '16 at 21:45
hey you're right! don't you forget who solves the issues here :) Very very good point. Deserves a little something. — Jean-François Fabre, Dec 13 '16 at 21:49
Well, I'm stuck, just like you. `cmd /u` just doesn't seem to work! — Jean-François Fabre, Dec 13 '16 at 22:01
@Jean-FrançoisFabre `cmd /?` states `/U Causes the output of **internal** commands to a pipe or file to be Unicode` IMO Fc.exe isn't internal. Eventually this helps [how-to-make-unicode-charset-in-cmd-exe-by-default](http://stackoverflow.com/questions/14109024) even if the output is UTF8 you could convert to your flavor of UTF — , Dec 14 '16 at 01:53
Seems like a flaw in `fc`. Not sure there's anything you can do about it. — Mark Ransom, Dec 14 '16 at 03:33
@LotPings: Thank you very much, a very usefull info! Yes, fc is not an "internal" command, it is stored in "windows\command". Both `cmd /u ...` and e.g. `chcp 65001` work only with internal commands (like dir) and seems to be completely without effect with `fc`. -- Really no way to overcome this??? — newbieforever, Dec 14 '16 at 06:38
`fc` is not an internal command, but redirection like `>` is controlled by `cmd`, so I'd expect `cmd /U` to affect redirections; perhaps you need to change `cmd /u /c fc /u TextA.txt TextB.txt > Comp.txt` to `cmd /u /c fc /u TextA.txt TextB.txt ^> Comp.txt` or to `cmd /u /c "fc /u TextA.txt TextB.txt > Comp.txt"` in order to force the redirection to be handled by the `cmd /U` instance you are invoking rather than the parent instance... — aschipfl, Dec 14 '16 at 11:38
...just found out that this does not work either (I can't explain why); anyway, the following works: `cmd /U /C fc /U "TextA.txt" "TextB.txt" ^> "Comp.tmp" ^& type "Comp.tmp" ^> "Comp.txt" & del "Comp.tmp"` (note that the output file does not contain the hex. `FF FE` prefix; see [this thread](http://stackoverflow.com/q/19725309) about how to generate it, then you can append to it by changing the portion `^> "Comp.txt"` to `^>^> "Comp.txt"`) — aschipfl, Dec 14 '16 at 12:04
...here is a way how to generate such a Unicode header: [batch: add a unicode header or how to add hex values or any other ways around this?](http://stackoverflow.com/a/41142676) — aschipfl, Dec 14 '16 at 12:30
@aschipfl: As already said, `cmd /u` has no effect on the redirection from `fc` (as an external command), so in this redirection already the special characters are not written correctly to the file (which cannot be repaired by converting this file to a unicode encoded file). You can test this by inserting eg the word **koča** (c with caron, easy to copy from a Google search result) in one of the two files: In comp.txt you will see **koca**! -- A strange issue, of course, but I think it is hopeless! — newbieforever, Dec 15 '16 at 06:52
`powershell -c ". fc.exe /u TextA.txt TextB.txt > Comp.txt"` note **fc.EXE** as `fc` is a powershell alias for `Format-Custom` cmdlet. Also note dot-sourced `. fc.exe` — JosefZ, Dec 15 '16 at 23:54
@JosefZ: The same as for aschipfl ... this generates a Unicode encoded file, but the special characters are not redirected/written corretly!!! — newbieforever, Dec 17 '16 at 11:39

JosefZ · Answer 1 · 2016-12-20T00:28:13.880

Short answer:

Use PowerShell's Compare-Object cmdlet as follows:

Compare-Object  (Get-Content ".\fileA.txt") (Get-Content ".\fileB.txt")

Basically customized output to a file:

Compare-Object (Get-Content ".\fileA.txt") (Get-Content ".\fileB.txt") |
  Format-Table -Property SideIndicator, InputObject -AutoSize -HideTableHeaders -Wrap |
    Out-File .\fileAB.txt -Encoding unicode

or

Compare-Object (Get-Content ".\fileA.txt") (Get-Content ".\fileB.txt") -PassThru |
    Out-File .\fileAB.txt -Encoding unicode

Original answer (see also amendment below):

The č letter (Latin Small Letter C With Caron, codepoint U+010D) appears in code pages 775/1257 (Baltic) and 852/1250 (Central Europe). I would suppose the latter as the koča word sounds like common Slavonic term for English hut, cabin or cottage.

Reproduce the problem. Next example shows possible mojibake case between OEM and ANSI code pages; apparently, cmd.exe itself makes some implicit (and unclear) character code transformations:

D:\test\Unicode> powershell -c "'fileA','fileB'|ForEach-Object {$_; Get-Content .\$_.txt}"
fileA
a lc ěščřžýáíé ď ť ň
a UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
fileB
b lc ěščřžýáíé ď ť ň
b UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň

D:\test\Unicode> chcp
Active code page: 1250

D:\test\Unicode> fc.exe /U .\fileA.txt .\fileB.txt > .\CompAB_1250.txt

D:\test\Unicode> type .\CompAB_1250.txt
Comparing files .\fileA.txt and .\FILEB.TXT
***** .\fileA.txt
a lc Řçźý§ě ˇ‚ Ô ś ĺ
a UC ·ć¬ü¦íµÖ Ň › Ő
***** .\FILEB.TXT
b lc Řçźý§ě ˇ‚ Ô ś ĺ
b UC ·ć¬ü¦íµÖ Ň › Ő
*****

cmd fix:

D:\test\Unicode> chcp 852
Active code page: 852

D:\test\Unicode> fc.exe /U .\fileA.txt .\fileB.txt > .\CompAB_852.txt

D:\test\Unicode> type .\CompAB_852.txt
Comparing files .\fileA.txt and .\FILEB.TXT
***** .\fileA.txt
a lc ěščřžýáíé ď ť ň
a UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
***** .\FILEB.TXT
b lc ěščřžýáíé ď ť ň
b UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
*****

In above example, both CompAB_1250.txt (garbled) and CompAB_852.txt (valid) are encoded in a one-byte code page. To get Unicode output, use PowerShell as follows:

PowerShell fix #1. Force PowerShell to use code page 852 from command line (use chcp 852 command explicitly before calling powershell):

D:\test\Unicode> chcp 852
Active code page: 852

D:\test\Unicode> powershell -c ". fc.exe /U .\fileA.txt .\fileB.txt > .\CompAB.txt"

D:\test\Unicode> powershell -c "'CompAB' | ForEach-Object {$_; Get-Content .\$_.txt}"
CompAB
Comparing files .\fileA.txt and .\FILEB.TXT
***** .\fileA.txt
a lc ěščřžýáíé ď ť ň
a UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
***** .\FILEB.TXT
b lc ěščřžýáíé ď ť ň
b UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
*****

PowerShell fix #2 Force PowerShell to use code page 852 on the fly regardless of active console code page and keeping the latter unchanged (for illustration, chosen 1252 code page which does not contain most of used letters):

D:\test\Unicode> chcp 1252
Active code page: 1252

D:\test\Unicode> powershell -c "[System.Console]::OutputEncoding=[System.Text.ASCIIEncoding]::GetEncoding(852);. fc.exe /U .\fileA.txt .\fileB.txt > .\CompAB.txt"

D:\test\Unicode> powershell -c "'CompAB' | ForEach-Object {$_; Get-Content .\$_.txt}"
CompAB
Comparing files .\fileA.txt and .\FILEB.TXT
***** .\fileA.txt
a lc ěščřžýáíé ď ť ň
a UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
***** .\FILEB.TXT
b lc ěščřžýáíé ď ť ň
b UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
*****

D:\test\Unicode> chcp
Active code page: 1252

Please run next commands from a newly opened cmd window for further explanation:

powershell -c "[console]::OutputEncoding"
chcp 1252
powershell -c "[console]::OutputEncoding"
chcp 1250
powershell -c "[console]::OutputEncoding"
chcp 852
powershell -c "[console]::OutputEncoding"
rem etc. etc. etc.

Edit (amendment): finally tested with some Greek characters added to input files; fc.exe output looks fine from command line fc.exe /U .\fileA.txt .\fileB.txt or even from Powershell:

D:\test\Unicode> powershell -c ". fc.exe /U .\fileA.txt .\fileB.txt"
Comparing files .\fileA.txt and .\FILEB.TXT
***** .\fileA.txt
a lc ěščřžýáíé ď ť ň
a    Ελληνικά  ΕΛΛΗΝΙΚΆ
a UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
***** .\FILEB.TXT
b lc ěščřžýáíé ď ť ň
b    Ελληνικά  ΕΛΛΗΝΙΚΆ
b UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
*****

However, > redirecting above output to a file as well as | piping it into another cmdlet leads to loss of information so that some characters are either garbled (via mojibake) or at least replaced by ? question mark, e.g. as follows:

PS D:\test\Unicode> . fc.exe /U .\fileA.txt .\fileB.txt | ForEach-Object {$_}
Comparing files .\fileA.txt and .\FILEB.TXT
***** .\fileA.txt
a lc ěščřžýáíé ď ť ň
a    ????????  ????????
a UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
***** .\FILEB.TXT
b lc ěščřžýáíé ď ť ň
b    ????????  ????????
b UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
*****

Wow, wow, wow! Thank you very much for your deep analysis! However, if I understand your method(s) correctly, this is not a "unicode solution" — newbieforever, Dec 19 '16 at 06:15
Wow, wow, wow! Thank you very much for your deep analysis! However, if I understand your method(s) correctly, this is not a "unicode solution". I used `koča` only as an example. The files in 'real life' could contain characters not only from a `cp 852` but from the entire unicode. — newbieforever, Dec 19 '16 at 06:30
@newbieforever there's no _"unicode solution"_ in `cmd` as `cmd` is not fully unicode-aware and basically follows [CHAR_INFO structure](https://msdn.microsoft.com/en-us/library/windows/desktop/ms682013(v=vs.85).aspx). Answer updated. — JosefZ, Dec 20 '16 at 00:49

console command & output to unicode

1 Answers1