Reading UTF-8 characters from console

Question

I'm trying to read UTF-8 encoded polish characters from console for my c++ application. I'm sure that console uses this code page (checked in properties). What I have already tried:

Using cin - instead of "zażółć" I read "za\0\0\0\0"
Using wcin - instead of "zażółć" - same result as with cin
Using scanf - instead of 'zażółć\0' I read 'za\0\0\0\0\0'
Using wscanf - same result as with scanf
Using getchar to read characters one by one - same result as with scanf

On the beginning of the main function I have following lines:

setlocale(LC_ALL, "PL_pl.UTF-8");
SetConsoleOutputCP(CP_UTF8);
SetConsoleCP(CP_UTF8);

I would be really greatful for help.

I will be surprised if you can get this to work, windows really doesn't use utf-8, it much prefers utf-16. — SoronelHaetir, Jan 09 '18 at 20:52
@J.Łyskawa - for me `ReadConsoleW` return exactly `zażółć` (of course as utf-16, if you want utf-8 - convert it itself) — RbMm, Jan 09 '18 at 21:22
"za|óB\a" - this what you may be view in memory debug window, etc, but really you read exactly "zazółc" — RbMm, Jan 09 '18 at 21:25
Codepage 65001 (UTF-8) does not work for reading non-ASCII (7-bit) characters, even with the new console in Windows 10. In older versions the entire call simply returns 0 characters read if even 1 non-ASCII character is in the string. In Windows 10 it substitutes NUL for non-ASCII characters. The bug is in the console host process. conhost.exe, which uses a scratch buffer to call `WideCharToMultiByte`, which is sized for single-byte or double-byte codepages (depending on system locale), but not a variable encoding such as UTF-8. — Eryk Sun, Jan 09 '18 at 21:26
simply run `WCHAR cc[256]; ULONG n; ReadConsoleW(GetStdHandle(STD_INPUT_HANDLE), cc, RTL_NUMBER_OF(cc), &n, 0); MessageBoxW(0,cc,0,0);` and you view what you exactly read — RbMm, Jan 09 '18 at 21:26
You must use Unicode (UTF-16) `ReadConsoleW` to read arbitrary non-ASCII characters. `ReadFile` and `ReadConsoleA` are limited to legacy single-byte and double-byte codepages. — Eryk Sun, Jan 09 '18 at 21:28
You can also do `_setmode(_fileno(stdin), _O_U16TEXT); _setmode(_fileno(stdout), _O_U16TEXT);` at start. Now you can use `std::wcin` and `std::wcout` for UTF-16 input and output. This is more efficient than UTF-8 because conversion is not necessary. If you absolutely need UTF-8, use `std::codecvt_utf8_utf16` to convert. — zett42, Jan 09 '18 at 22:54

Killzone Kid · Accepted Answer · 2018-01-09T23:57:14.390

8

Here is the trick I use for UTF-8 support. The result is multibyte string which could be then used elsewhere:

#include <cstdio>
#include <windows.h>
#define MAX_INPUT_LENGTH 255

int main()
{

    SetConsoleOutputCP(CP_UTF8);
    SetConsoleCP(CP_UTF8);

    wchar_t wstr[MAX_INPUT_LENGTH];
    char mb_str[MAX_INPUT_LENGTH * 3 + 1];

    unsigned long read;
    void *con = GetStdHandle(STD_INPUT_HANDLE);

    ReadConsole(con, wstr, MAX_INPUT_LENGTH, &read, NULL);

    int size = WideCharToMultiByte(CP_UTF8, 0, wstr, read, mb_str, sizeof(mb_str), NULL, NULL);
    mb_str[size] = 0;

    std::printf("ENTERED: %s\n", mb_str);

    return 0;
}

Should look like this:

P.S. Big thanks to Remy Lebeau for pointing out some flaws!

edited Jan 09 '18 at 23:57

answered Jan 09 '18 at 21:29

Killzone Kid

6,171
3
17
37

Thank you very much, it is exactly what I have used after @RbMb suggestion. – J. Łyskawa Jan 09 '18 at 21:43
The console can't render non-BMP surrogate pairs, but it does preserve them. So read `MAX_INPUT_LENGTH - 1` characters. If this splits a surrogate pair (i.e. `wstr[read - 1]` is a lead surrogate), then read one more character. – Eryk Sun Jan 09 '18 at 22:00
1

Also, if you are going to call `WideCharToMultiByte()` to get the required size of `mb_str`, you should dynamically allocate `mb_str` to that size, otherwise you risk a buffer overflow when `mb_str` is a static array (`* 2` is not adequate when going from UTF-16 to UTF-8, you need `* 3` or even `* 4` instead). If you do use a static array, you don't need to call `WideCharToMultiByte()` twice, just pass the max size of `mb_str` instead: `int size = WideCharToMultiByte(CP_UTF8, 0, wstr, read, mb_str, sizeof(mb_str), NULL, NULL); mb_str[size] = 0;` – Remy Lebeau Jan 09 '18 at 22:13
@RemyLebeau Thank you for pointing this out, I edited the code. – Killzone Kid Jan 09 '18 at 22:23
1

You can scale back to `MAX_INPUT_LENGTH * 3`. A BMP code is up to 3 bytes in UTF-8. Beyond that, UTF-16 uses 2 surrogate codes, which is 4 bytes as UTF-8 and over-allocated as 6 bytes in `mb_str`. – Eryk Sun Jan 09 '18 at 22:29
@eryksun: `MAX_INPUT_LENGTH` is being used as an *element* count, not a *byte* count. A BMP codepoint takes 1 element in UTF-16 and up to 3 elements in UTF-8. A non-BMP codepoint takes 2 elements in UTF-16, and 4 elements in UTF-8. So, you are right, `* 3` is sufficient for all possible conversions. I would still not use a static array, though. – Remy Lebeau Jan 09 '18 at 23:28
@RemyLebeau, I used "bytes" in reference to UTF-8. It seems innocuous to use bytes and elements interchangeably for a `char` array. Anyway, generally in this situation I'd expect the caller to supply the buffer (e.g. like `_read`), which is to be filled as much as possible. For this you'd have to loop, reading and UTF-8 encoding n - 1 elements in each pass, where n is the remaining buffer size divided by 3. n - 1 elements are read instead of n in case a second read is required to complete a surrogate pair, as mentioned above. – Eryk Sun Jan 10 '18 at 00:43
1

Remove `SetConsoleOutputCP(CP_UTF8)` and `SetConsoleCP(CP_UTF8)`. Setting the output codepage to UTF-8 is broken prior to Windows 8 (`WriteFile` and `WriteConsoleA` return the wrong number of bytes written for Non-ASCII characters), and setting the input codepage to UTF-8 is severely broken in all Windows versions (`ReadFile` and `ReadConsoleA` either replace non-ASCII characters with NUL or return 0 bytes read). Ensure `UNICODE` is defined so that `ReadConsole` is really `ReadConsoleW`, or call `ReadConsoleW` explicitly. – Eryk Sun Jan 11 '18 at 10:25
@eryksun AFAIK ReadConsole is defined as ReadConsoleW by default, if you want ANSI version you need to use ReadConsoleA explicitly – Killzone Kid Jan 11 '18 at 10:59
@eryksun `>> Remove SetConsoleOutputCP(CP_UTF8) and SetConsoleCP(CP_UTF8)` I'd rather not. Removing `SetConsoleOutputCP(CP_UTF8)` prints gibberish. – Killzone Kid Jan 11 '18 at 12:01
1

You should be using `WriteConsoleW` to write UTF-16 to the console, or set the CRT mode to `_O_U16TEXT` and use the wide-character C API. `WriteFile` and `WriteConsoleA` are broken with UTF-8 in Windows 7. For example, using a buffered `FILE` with C `fwrite` will see the wrong number of bytes written and try to write the 'remaining' bytes in several writes that are all gibberish in UTF-8. And `ReadFile` and `ReadConsoleA` for the console are completely broken in UTF-8, even in Windows 10, so there is absolutely no point (zero gain -- all pain) in settting the console input codepage to UTF-8. – Eryk Sun Jan 11 '18 at 12:29
1

Hi there, just wanted to say that I love your test string. – Daniel Kamil Kozar Jan 13 '18 at 14:12

Davislor · Answer 2 · 2018-01-13T14:09:50.920

Although you’ve already accepted an answer, here’s a more portable version, which sticks closer to the standard library. Unfortunately, this is one area where I’ve found that a lot of widely-used implementations do not support things that are supposedly in the standard. For example, there is supposed to be a standard way to print multi-byte strings (which theoretically could be something unusual like shift-JIS, but in practice are UTF-8 on every modern OS), but it does not actually work portably. Microsoft’s runtime library is especially poor in this regard, but I’ve also found bugs in libc++.

/* Boilerplate feature-test macros: */
#if _WIN32 || _WIN64
#  define _WIN32_WINNT  0x0A00 // _WIN32_WINNT_WIN10
#  define NTDDI_VERSION 0x0A000002 // NTDDI_WIN10_RS1
#  include <sdkddkver.h>
#else
#  define _XOPEN_SOURCE     700
#  define _POSIX_C_SOURCE   200809L
#endif

#include <iostream>
#include <locale>
#include <locale.h>
#include <stdlib.h>
#include <string>

#ifndef MS_STDLIB_BUGS // Allow overriding the autodetection.
/* The Microsoft C and C++ runtime libraries that ship with Visual Studio, as
 * of 2017, have a bug that neither stdio, iostreams or wide iostreams can
 * handle Unicode input or output.  Windows needs some non-standard magic to
 * work around that.  This includes programs compiled with MinGW and Clang
 * for the win32 and win64 targets.
 *
 * NOTE TO USERS OF TDM-GCC: This code is known to break on tdm-gcc 4.9.2. As
 * a workaround, "-D MS_STDLIB_BUGS=0" will at least get it to compile, but
 * Unicode output will still not work.
 */
#  if ( _MSC_VER || __MINGW32__ || __MSVCRT__ )
    /* This code is being compiled either on MS Visual C++, or MinGW, or
     * clang++ in compatibility mode for either, or is being linked to the
     * msvcrt (Microsoft Visual C RunTime) library.
     */
#    define MS_STDLIB_BUGS 1
#  else
#    define MS_STDLIB_BUGS 0
#  endif
#endif

#if MS_STDLIB_BUGS
#  include <io.h>
#  include <fcntl.h>
#endif

using std::endl;
using std::istream;
using std::wcin;
using std::wcout;

void init_locale(void)
// Does magic so that wcout can work.
{
#if MS_STDLIB_BUGS
  // Windows needs a little non-standard magic.
  constexpr char cp_utf16le[] = ".1200";
  setlocale( LC_ALL, cp_utf16le );
  _setmode( _fileno(stdout), _O_WTEXT );
  _setmode( _fileno(stdin), _O_WTEXT );
#else
  // The correct locale name may vary by OS, e.g., "en_US.utf8".
  constexpr char locale_name[] = "";
  setlocale( LC_ALL, locale_name );
  std::locale::global(std::locale(locale_name));
  wcout.imbue(std::locale());
  wcin.imbue(std::locale());
#endif
}

int main(void)
{
  init_locale();

  static constexpr size_t bufsize = 1024;
  std::wstring input;
  input.reserve(bufsize);

  while ( wcin >> input )
    wcout << input << endl;

  return EXIT_SUCCESS;
}

This reads in wide-character input from the console regardless of its initial locale or code page. If what you meant instead was that the input will be bytes in the UTF-8 encoding (such as from a redirected file in UTF-8 encoding), not console input, the standard way to accomplish this is supposed to be the conversion facet from UTF-8 to wchar_t in <codecvt> and <locale>, but in practice Windows doesn’t support Unicode locales, so you have to read the bytes in and then convert them manually. A more standard way to do that is mbstowcs(). I have some old code to do the conversion for STL iterators, but there are also conversion functions in the standard library. You might need to do this anyway, if for example you need to save or transmit in UTF-8.

There are some who will recommend you store all strings in UTF-8 internally even when using an API like Windows based on some form of UTF-16, converting to another encoding only when you make API calls. I strongly advise you to use UTF-8 externally whenever you possibly can, but I don’t go quite that far. Note, however, that storing strings as UTF-8 will save you a lot of memory, especially on systems where wchar_t is UCS-32. You would have a better idea than I how many bytes this would typically save you for Polish text.

Conversion between UTF-8 and UTF-16 is simple enough that I wrote my own: https://stackoverflow.com/a/148766/5987. Might be easier than relying on inconsistent implementations of the standard library. I also once made a program that kept strings internally as UTF-8 just for fun. The case for UTF-8 is made at http://utf8everywhere.org/. — Mark Ransom, Jan 10 '18 at 04:21
@J.Łyskawa This bugfix works for me on VC 2017, with the console set to code page 437, 1251 or 65001 (UTF-8, presumably what you want). Does it work for you? — Davislor, Jan 13 '18 at 14:00
As of 2022, Windows 10 and 11 in theory support code page 65001, a `".65001"` family of UTF-8 locales, and the `_O_U8TEXT` flag for `_setmode` to support UTF-8. In practice, they do not appear to work. — Davislor, Jan 07 '22 at 03:11

Reading UTF-8 characters from console

2 Answers2

Linked