0

When I dump out Kafka messages into Windows cmd with the following command:

kafka-console-consumer  --bootstrap-server example.server.net:31128 --topic MY_TOPIC_NAME --from-beginning

the Cyrillic characters are shown in the console as abracadabra.

I was following this answer to change my code page to UTF-8 with chcp 65001. Then I run cmd with cmd /u and set font to Lucida Console.

So if I write some Cyrillic characters into some txt file:

c:\test>echo привет > test.txt

the Cyrillic text is shown just fine. Then I dump out messages again but Cyrillic characters are still bad.

I was told that Kafka uses UTF-8 to store the data.

Could you please guide me how should I configure Win cmd to solve this issue?

Dmitry Stepanov
  • 2,776
  • 8
  • 29
  • 45
  • Choose a font that supports Unicode. – Noodles May 31 '19 at 09:37
  • 1
    CMD is just a shell. You're likely using the system console, so display results depend on the Windows version and configuration, such as the selected TrueType font. Exactly what command are you running to "dump out" text? CMD's internal `type` command? more.com? These commands have to decode the input byte stream to Unicode since they use the console's wide-character API. We need to know the source encoding, and from there it depends on the command. `type` decodes using the console's output codepage. more.com decodes with the console's input codepage. chcp.com sets both. – Eryk Sun May 31 '19 at 09:40
  • I recommend reading [Using another language (code page) in a batch file made for others](https://stackoverflow.com/questions/48981387/) and [How to fix a batch file with an Hebrew font?](https://stackoverflow.com/questions/30478940/) and [Why are Danish characters not displayed as in text editor on executing batch file?](https://stackoverflow.com/questions/43046559/) They are all about same problem: missing knowledge about [character encoding](https://en.wikipedia.org/wiki/Character_encoding). – Mofi May 31 '19 at 09:55
  • @eryksun, I use external command to read Kafka messages. The source encoding is UTF-8 – Dmitry Stepanov May 31 '19 at 11:22
  • 1
    If the program naively writes bytes to the console, however they're encoded, via `WriteFile` or `WriteConsoleA`, then if those bytes happen to be UTF-8 encoded text, it should display properly if the console output codepage is set to 65001 and the selected font supports Cyrillic characters. (Depending on the program, in Windows 7 it might also write some garbage after each print, due to a bug in the Windows 7 console that causes it to misreport the number of bytes written.) However if the program makes assumptions such as transcoding UTF-8 to ANSI or OEM, the result will mojibake nonsense. – Eryk Sun May 31 '19 at 11:59
  • @DmitryStepanov Could you please show a screenshot with the issue? – montonero May 31 '19 at 12:39
  • @eryksun, you are right. The issue was that when I open command promt and check the code page it says it was 866 but not 65001. It looks like I have to change it every time from 866 to 56001 when I open a new console window. – Dmitry Stepanov May 31 '19 at 13:18
  • 1
    Set a "CodePage" dword value of 65001 in a subkey of "HKCU\Console" that's named for the initial title of the console window, but with backslash replaced by underscore and the Windows directory replaced by "%SystemRoot%". By default, if the title isn't set explicitly, Windows uses the executable path as the title. For example, for cmd.exe the subkey is "HKCU\Console\%SystemRoot%_system32_cmd.exe". This applies if we run CMD via the Win+R run dialog or `start cmd`, but *not* from a shortcut. (When entering the dword, make sure that decimal is selected in the dialog instead of hexadecimal.) – Eryk Sun May 31 '19 at 14:05
  • 1
    The codepage can be set in a shortcut, but I don't think the COM API to do so is documented, without which we'd have to modify the .LNK file directly. Maybe there's a library for that. Alternatively, in Windows 10 we can actually configure the ANSI and OEM system codepages as 65001 (UTF-8) in the administrative region settings in the control panel. The console defaults to OEM. (It would be nice if setting a "CodePage" default in "HKCU\Console" worked...) – Eryk Sun May 31 '19 at 14:12
  • @eryksun, yes, setting dword for `CodePage` in registry worked! Thanks a lot. – Dmitry Stepanov Jun 04 '19 at 07:05
  • UTF-8 input via `ReadFile` or `ReadConsoleA` is still limited to 7-bit ASCII in all versions of Windows. UTF-8 uses 2-4 bytes per non-ASCII character, but the console assumes 1 byte per character. Thus, in Windows 7, a read that contains a non-ASCII character 'succeeds' with 0 bytes read. In Windows 10, a read of N characters will read N bytes, but non-ASCII characters are translated to null bytes. Given this, you might want to only set the "CodePage" value for a specific window title such as "UTF-8 Console", used like `start "UTF-8 Console"`. – Eryk Sun Jun 04 '19 at 08:19
  • Note that this UTF-8 bug doesn't affect Unicode (UTF-16) console clients such as CMD, PowerShell and Python 3.6+, which read input via the wide-character functions `ReadConsoleW` or `ReadConsoleInputW`. It only affects console clients that use `ReadFile`, `ReadConsoleA`, or `ReadConsoleInputA`, which depend on the console input codepage, i.e. which depend on the console translating its UTF-16 input buffer to a legacy codepage or UTF-8. – Eryk Sun Jun 04 '19 at 08:24

0 Answers0