1

I am making a program, like flashcards but console based. At the start of the program, I read from a file containing UTF-8 encoded Japanese characters (such as "ひらがな, カタカナ, 患者"). However, when I call std::getline(), the input comes out as "". How can I achieve this? Maybe opening STD_INPUT_HANDLE as a file? I use SetConsoleOutputCP() and SetConsoleCP() with CP_UTF8 as an argument to enable UTF-8 printing.

Issue in action

Minimal Reproducible Example, as requested by @πάντα ῥεῖ

#include <iostream>
#include <Windows.h>
#include <fstream>
#include <vector>
#include <string>

void populate(std::vector<std::string>& in) {
    std::ifstream file("words.txt"); // fill this with some UTF-8 characters, then check the contents of [in]

    std::string line;
    while (std::getline(file, line)) {
        in.emplace_back(line);
    }
}

int main() {
    SetConsoleOutputCP(CP_UTF8);
    SetConsoleCP(CP_UTF8);

    SetConsoleTitleA("Example");

    std::vector<std::string> arr;
    populate(arr);

    std::string input_utf8; // type some UTF-8 characters when asked for input
    std::cin >> input_utf8;

    for (std::string s : arr)
        if (input_utf8 == s)
            std::cout << "It works! The input wasn't null!";
}
Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
racc2
  • 23
  • 4
  • 2
    UTF-8 support in Windows [UCR](https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page) and [ACP](https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page#-a-vs--w-apis). (Links courtesy of [phuclv](https://stackoverflow.com/a/63454192/4641116).) – Eljay Nov 03 '21 at 19:13
  • What is the length of the string returned by `std::getline`? You seem to be using the character code `10` (ASCII for `\n`) as the deliminating character in your call to `std::getline`. Does the byte value `10` maybe appear as part of a UTF-8 character sequence, so that `std::getline` misinterprets that part of the character sequence as an end of line? What are the exact byte values of the file contents? – Andreas Wenzel Nov 03 '21 at 19:15
  • @AndreasWenzel I believe you are correct. The string length is equal to 4, the characters I entered. How would I go about fixing this issue? I'm not used to UTF-8 stuff yet, so any tips would be appreciated. – racc2 Nov 03 '21 at 19:21
  • Sorry, I'm not sure what you mean "byte values". If you mean the actual bytes or hex of the file, it is [here](https://pastebin.com/raw/XUXUyqjp). Otherwise, the actual file is [here](https://pastebin.com/raw/h6q0DPn1). The code 10 does not appear in the pattern. – racc2 Nov 03 '21 at 19:30
  • 2
    @racc2 Post a [mcve] as required here please! [Edit] your question, if you hope to get any useful answers about what's wrong with your code. Links or images generally don't count!! – πάντα ῥεῖ Nov 03 '21 at 19:34
  • @racc2: In your posted input file, the 13th byte has the value `10` (`0x0A` in hexadecimal), which is the ASCII character code for a newline character. I believe that every Japanese character requires 3 bytes in UTF-8 encoding. That means that the first 12 bytes represent Japanese characters, which corresponds to 4 Japanese characters. This coincides with your report that the string returned by `std::getline` has a length of 4 characters. Therefore, the problem does not seem to be a matter of input, but rather a matter of output. – Andreas Wenzel Nov 03 '21 at 19:38
  • @AndreasWenzel I don't believe so because I cannot compare the strings either. I can not write it to a file, I cannot manipulate the string. It is not equal "" or to "\n". – racc2 Nov 03 '21 at 19:46
  • Thank you for pointing out the 13th byte! The original string "カタカナ”, when read from the file, appears consistently as "カタカナ" in the VS Debugger, though the input appears empty. This makes me think it is an input issue. – racc2 Nov 03 '21 at 19:52
  • @racc2: The debugger is probably using the wrong code page to display the string, as it has now way of knowing that the string is encoded in UTF-8. Therefore, I suggest that you look at the number values of the individual bytes instead, and use a [third-party UTF-8 encoder](https://mothereff.in/utf-8) to interpret these numbers. – Andreas Wenzel Nov 03 '21 at 23:30
  • FWIW, I created a text file with your J words in it (ひらがな, カタカナ, 患者) and just printed the file in the console window: `type words.txt`. It printed nonsense. However, if I cut and paste that nonsense into my web browser here, it looks like real J text again. So this does not seem to be a Visual Studio problem, but a Command Prompt display problem. It was OK showing non-ASCII European text: èéøÞǽлљΣæča. I have tried various code pages including 65001 (`chcp 65001`) but to no avail. – Topological Sort Nov 05 '21 at 13:45
  • This page suggests changing your locale on your machine. When OP said that was not suitable for him, the expert said she didn't think there was any other method. https://social.technet.microsoft.com/Forums/windowsserver/en-US/728d6169-07fa-4e93-8a41-f9fa6dd8d9d9/displaying-japanese-in-english-command-prompt – Topological Sort Nov 05 '21 at 13:49
  • https://www.curlybrace.com/words/2014/06/05/unicode-and-the-windows-console/ seems to confirm this, as do other links. – Topological Sort Nov 05 '21 at 13:50
  • I'm running your [mcve] redirecting input from words.txt, so that it should find that its first line カタカナ does indeed match カタカナ. I had to be sure there was no space at the end -- std::getline will make the space part of the result, but std::cin >> won't. So I think this is solved. Let me know if you agree. – Topological Sort Nov 05 '21 at 16:04
  • @TopologicalSort that is the issue. The file reads fine, but console input does not. – racc2 Nov 08 '21 at 19:42
  • As of 2018, at least, Microsoft says they don't support Unicode input yet in the console window: https://devblogs.microsoft.com/commandline/windows-command-line-unicode-and-utf-8-output-text-buffer/. I just installed Windows Terminal Preview https://www.microsoft.com/en-us/p/windows-terminal-preview/9n0dx20hk701?activetab=pivot:overviewtab, and it's at least willing to let me paste in カタカナ as input to the program, but IDK how to get Visual Studio to use it, nor is it working yet. I think we're going to find the tech just isn't up to it. – Topological Sort Nov 09 '21 at 21:20

2 Answers2

0

This program works for me. I needed the code page 932 (Shift-JIS) to get things to show up right. (I do not have Japanese enabled on my Windows 10 machine,so it doesn't depend on that.) If I just std::cin or std::wcin, I can see in the debugger I am not getting the right input. But if I use ReadConsoleW/WriteConsoleW everything looks correct.

#define _CRT_SECURE_NO_WARNINGS
#include <windows.h>
#include <iostream>

using namespace std;

int main()
{
                                        //This code-page-changing stuff, plus the restoring later, is from
                                        //https://www.codeproject.com/articles/34068/unicode-output-to-the-windows-console
    UINT oldcp = GetConsoleOutputCP();  //what is the current code page? store for later
    SetConsoleOutputCP(932);            //set it up so it can do Japanese

    cout << "Enter something: "; 

    wchar_t wmsg[32];
    DWORD used;
    if (!ReadConsole(GetStdHandle(STD_INPUT_HANDLE),
        wmsg,
        31, //because wmsg has 32 slots. ?
        &used,
        nullptr))
        cerr << "ReadConsole failed, le = " << GetLastError() << endl;

    size_t len = used;
    cout << "You entered: ";
    //From https://cboard.cprogramming.com/windows-programming/112382-printing-unicode-console.html
    if (!WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), 
            wmsg, (DWORD) len,
            &used, 0))
            cerr << "WriteConsole failed, le = " << GetLastError() << endl;
    cout << '\n';

    cout << "Hit enter to end (and restore previous code page)."; cin.get();
    SetConsoleOutputCP(oldcp); SetConsoleCP(oldcp);
    return 0;
}
Topological Sort
  • 2,733
  • 2
  • 27
  • 54
0

I had the same problem and wrote a very small library called libpu8 for this: https://github.com/jofeu/libpu8 Using this library, the problem can be solved easily with only a few changes:

#include <iostream>
#include <Windows.h>
#include <fstream>
#include <vector>
#include <string>
#include <libpu8.h>

void populate(std::vector<std::string>& in) {
    // would also work for utf-8 filenames
    std::ifstream file(u8widen("words.txt"));

    std::string line;
    while (std::getline(file, line)) {
        in.emplace_back(line);
    }
}

int main_utf8(int argc, char** argv) // argv is now utf-8
{
    // would also work for utf-8 titles
    SetConsoleTitleW(u8widen("Example").c_str());

    std::vector<std::string> arr;
    populate(arr);

    std::string input_utf8;
    // works if cin is attached to a console, and also if 
    // cin is attached to a utf-8 file or pipe
    std::cin >> input_utf8;

    // works if cout is attached to a console, and also if 
    // cout is attached to a utf-8 file or pipe
    for (std::string s : arr)
        if (input_utf8 == s)
            std::cout << "It works! The input wasn't null!";
    return 0;
}

Also see my answer to a similar question: https://stackoverflow.com/a/64871504/11458816

umbert
  • 51
  • 5