C++ UTF-8 Swedish Characters are Read as ASCII

Question

There is a C++ program that I need to add the ability to read a file. I found that it isn't working for European special characters. The example I'm working with are Swedish characters.

I changed the code to use wide characters, but that doesn't seem to have helped.

The sample text file that I'm reading has the following content:

"NEW-DATA"="Nysted Vi prøver lige igen"

This is on Windows and Nodepad says that this file is using UTF-8 encoding.

In Visual Studio, when debugging, the string that is read is being displayed as if the string is in ASCII:

ï»¿"NEW-DATA"="Nysted Vi prÃ¸ver lige igen"

I changed the code to use the "wide" methods:

    std::wifstream infile;
    infile.open(argv[3], std::wifstream::in);
    if (infile.is_open())
    {
        std::wstring line;
        while (std::getline(infile, line))
        {

....

Is there something else I need to do to get it to correctly recognize UTF-8?

Possible duplicate of [Reading UTF-8 text and converting to UTF-16 using standard C++ wifstream](https://stackoverflow.com/questions/21636374/).. — Remy Lebeau, Jun 14 '18 at 18:43
Off-topic, but that’s not Swedish, it’s Danish. Swedish doesn’t use ”ø”. — molbdnilo, Jun 14 '18 at 19:15
BTW—The file has a BOM (which you show as "ï»¿"). A BOM is metadata that is not part of the text content of the file. So, it should not be allowed to end up in `line`. See this [BOM test and `putback` example](https://stackoverflow.com/a/8882051/2226988). (The BOM is likely part of the guessing that Notepad used to come up with UTF-8 as the encoding. If you don't like guessing, ask the sender which encoding was used. It might be a good idea to agree on UTF-8 so you can rely on it each time you open any version of any text file from the same source.) — Tom Blodget, Jun 15 '18 at 23:47

R Sahu · Answer 1 · 2018-06-14T18:36:18.883

0

You can read UTF-8 content as ASCII text but will have to convert them to wide characters to allow Visual Studio to interpret it as unicode.

Here's a stock function what we use for that:

BSTR UTF8ToBSTR(char const* astr)
{
   static wchar_t wstr[BUFSIZ];

   // Look for the funtion description in MSDN.
   // Use of CP_UTF8 indicates that the input is UTF8 string.

   // Get the size of the output needed for the conversion.
   int size = MultiByteToWideChar(CP_UTF8, 0, astr, -1, NULL, 0);

   // Do the conversion and get the output.
   MultiByteToWideChar(CP_UTF8, 0, astr, -1, wstr, size);

   // Allocate memory for the BSTR and return the BSTR.
   return SysAllocString(wstr);
}

You'll have to add code to deallocate the memory allocated by the call SysAllocString(wstr).

E.g.

BSTR bstr = UTF8ToBSTR(...);

// Use bstr
// ...


// Deallocate memory
SysFreeString(bstr);

edited Jun 14 '18 at 18:36

answered Jun 14 '18 at 18:31

R Sahu

204,454
14
159
270

Is there a portable method of handling UTF-8 on both Windows and Linux? I was hoping to just be able to use wchar_t. I think that BSTR is Microsoft-specific. – George Hernando Jun 14 '18 at 18:37
We're saying that this is just a Visual Studio limitation, right? Is it really worth all this change to the program just for the sake of the debugger window? The actual data in memory should be fine. – Lightness Races in Orbit Jun 14 '18 at 18:39
@GeorgeHernando: `BSTR` is effectively a type alias for a `wchar_t*` – Lightness Races in Orbit Jun 14 '18 at 18:39
1

@GeorgeHernando "*Is there a portable method of handling UTF-8 on both Windows and Linux?*" - look at [`std::wstring_convert`](https://en.cppreference.com/w/cpp/locale/wstring_convert). – Remy Lebeau Jun 14 '18 at 18:40
We use Qt at work. It provides the ability to construct `QString` from UTF-8 encoded C-style strings. I suspect there are other libraries that provide similar functions. You'll have to look them up. – R Sahu Jun 14 '18 at 18:41
1

@GeorgeHernando if all you want is cross-platform code for converting between UTF-8 and `wchar_t`, you can check out my code at https://stackoverflow.com/a/148766/5987. – Mark Ransom Jun 14 '18 at 18:50
this is very C-style. Wouldn't a wrapper class be more C++ like? – JHBonarius Jun 14 '18 at 18:57
@JHBonarius, of course it would. It's something that we have been using for a very long time :) – R Sahu Jun 14 '18 at 19:00

Mike Nakis · Answer 2 · 2018-06-14T18:50:05.693

What is happening is that you have a UTF-8-encoded file but you are trying to read it as if it was consisting of wide characters. That won't work. As you can see, the BOF marker has been read into your string verbatim, so obviously, the mechanism you are using does not contain any logic that tries to do any kind of parsing of the characters and decoding of UTF-8 byte-pairs.

Wide characters and UTF-8 are two fundamentally different things. There's no way that you're going to be able to read UTF-8 just by plopping in wchar_t (or std::wstring) and reading into it. You're going to need to use some kind of unicode library. There's std::wstring_convert in C++11 (but that requires tool support) and there's the manual mbstowcs()/wcstombs() route. It's all around better to use a library.

Source: https://www.reddit.com/r/cpp/comments/108o7g/reading_utf8_encoded_text_files_to_stdwstring/

I presume that mbstowcs()/wcstombs() are the portable alternatives to Microsoft's MultiByteToWideChar() and MultiByteToWideChar().

You can `imbue()` a UTF-8 locale into `std::wifstream` and then it can read UTF-8 files and convert them to whatever encoding the compiler uses for `wchar_t` (UTF-16 on Windows, UTF-32 on other platforms). — Remy Lebeau, Jun 14 '18 at 18:44

C++ UTF-8 Swedish Characters are Read as ASCII

2 Answers2