Short answer:
Use PowerShell's Compare-Object
cmdlet as follows:
Compare-Object (Get-Content ".\fileA.txt") (Get-Content ".\fileB.txt")
Basically customized output to a file:
Compare-Object (Get-Content ".\fileA.txt") (Get-Content ".\fileB.txt") |
Format-Table -Property SideIndicator, InputObject -AutoSize -HideTableHeaders -Wrap |
Out-File .\fileAB.txt -Encoding unicode
or
Compare-Object (Get-Content ".\fileA.txt") (Get-Content ".\fileB.txt") -PassThru |
Out-File .\fileAB.txt -Encoding unicode
Original answer (see also amendment below):
The č
letter (Latin Small Letter C With Caron, codepoint U+010D
) appears in code pages 775
/1257
(Baltic) and 852
/1250
(Central Europe). I would suppose the latter as the koča
word sounds like common Slavonic term for English hut, cabin or cottage.
Reproduce the problem. Next example shows possible mojibake case between OEM
and ANSI
code pages; apparently, cmd.exe
itself makes some implicit (and unclear) character code transformations:
D:\test\Unicode> powershell -c "'fileA','fileB'|ForEach-Object {$_; Get-Content .\$_.txt}"
fileA
a lc ěščřžýáíé ď ť ň
a UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
fileB
b lc ěščřžýáíé ď ť ň
b UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
D:\test\Unicode> chcp
Active code page: 1250
D:\test\Unicode> fc.exe /U .\fileA.txt .\fileB.txt > .\CompAB_1250.txt
D:\test\Unicode> type .\CompAB_1250.txt
Comparing files .\fileA.txt and .\FILEB.TXT
***** .\fileA.txt
a lc Řçźý§ě ˇ‚ Ô ś ĺ
a UC ·ć¬ü¦íµÖ Ň › Ő
***** .\FILEB.TXT
b lc Řçźý§ě ˇ‚ Ô ś ĺ
b UC ·ć¬ü¦íµÖ Ň › Ő
*****
cmd
fix:
D:\test\Unicode> chcp 852
Active code page: 852
D:\test\Unicode> fc.exe /U .\fileA.txt .\fileB.txt > .\CompAB_852.txt
D:\test\Unicode> type .\CompAB_852.txt
Comparing files .\fileA.txt and .\FILEB.TXT
***** .\fileA.txt
a lc ěščřžýáíé ď ť ň
a UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
***** .\FILEB.TXT
b lc ěščřžýáíé ď ť ň
b UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
*****
In above example, both CompAB_1250.txt
(garbled) and CompAB_852.txt
(valid) are encoded in a one-byte code page. To get Unicode output, use PowerShell as follows:
PowerShell fix #1. Force PowerShell
to use code page 852
from command line (use chcp 852
command explicitly before calling powershell
):
D:\test\Unicode> chcp 852
Active code page: 852
D:\test\Unicode> powershell -c ". fc.exe /U .\fileA.txt .\fileB.txt > .\CompAB.txt"
D:\test\Unicode> powershell -c "'CompAB' | ForEach-Object {$_; Get-Content .\$_.txt}"
CompAB
Comparing files .\fileA.txt and .\FILEB.TXT
***** .\fileA.txt
a lc ěščřžýáíé ď ť ň
a UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
***** .\FILEB.TXT
b lc ěščřžýáíé ď ť ň
b UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
*****
PowerShell fix #2 Force PowerShell
to use code page 852
on the fly regardless of active console code page and keeping the latter unchanged (for illustration, chosen 1252
code page which does not contain most of used letters):
D:\test\Unicode> chcp 1252
Active code page: 1252
D:\test\Unicode> powershell -c "[System.Console]::OutputEncoding=[System.Text.ASCIIEncoding]::GetEncoding(852);. fc.exe /U .\fileA.txt .\fileB.txt > .\CompAB.txt"
D:\test\Unicode> powershell -c "'CompAB' | ForEach-Object {$_; Get-Content .\$_.txt}"
CompAB
Comparing files .\fileA.txt and .\FILEB.TXT
***** .\fileA.txt
a lc ěščřžýáíé ď ť ň
a UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
***** .\FILEB.TXT
b lc ěščřžýáíé ď ť ň
b UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
*****
D:\test\Unicode> chcp
Active code page: 1252
Please run next commands from a newly opened cmd
window for further explanation:
powershell -c "[console]::OutputEncoding"
chcp 1252
powershell -c "[console]::OutputEncoding"
chcp 1250
powershell -c "[console]::OutputEncoding"
chcp 852
powershell -c "[console]::OutputEncoding"
rem etc. etc. etc.
Edit (amendment): finally tested with some Greek characters added to input files; fc.exe
output looks fine from command line fc.exe /U .\fileA.txt .\fileB.txt
or even from Powershell:
D:\test\Unicode> powershell -c ". fc.exe /U .\fileA.txt .\fileB.txt"
Comparing files .\fileA.txt and .\FILEB.TXT
***** .\fileA.txt
a lc ěščřžýáíé ď ť ň
a Ελληνικά ΕΛΛΗΝΙΚΆ
a UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
***** .\FILEB.TXT
b lc ěščřžýáíé ď ť ň
b Ελληνικά ΕΛΛΗΝΙΚΆ
b UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
*****
However, >
redirecting above output to a file as well as |
piping it into another cmdlet leads to loss of information so that some characters are either garbled (via mojibake) or at least replaced by ?
question mark, e.g. as follows:
PS D:\test\Unicode> . fc.exe /U .\fileA.txt .\fileB.txt | ForEach-Object {$_}
Comparing files .\fileA.txt and .\FILEB.TXT
***** .\fileA.txt
a lc ěščřžýáíé ď ť ň
a ???????? ????????
a UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
***** .\FILEB.TXT
b lc ěščřžýáíé ď ť ň
b ???????? ????????
b UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
*****