How to convert utf character to windows-1252?

Question

I have a string with a currency symbol:

std::string currency = "€";

I have converted it to unsigned chars:

const unsigned char* buf = reinterpret_cast<unsigned const char*>(currency.data());

for(auto i = 0u; i < currency.length(); ++i)
{
    std::cout << std::hex << static_cast<int>(buf[i]) << std::endl;

}

and according to this description I get UTF-8 character representation: 0xE2 0x82 0xAC. I use gcc/Linux.

1.Is it C++ cross platform behaviour?

I have a device which uses windows-1252 encoding where euro currency symbol is represented by 0x80.

2.How to perform conversion from UTF-8 to windows-1252 ? Is it possible in more generic/automatic way than:

unsigned char eurWindows1252;
if(currency == "€")
{
    eurWindows1252 = 0x80;
}

Usually C++ will encode the string using whatever bytes were in the source. If the source is UTF-8 the string will be too, and if the source is Windows-1252 so will the string. The standard has defined new prefixes that can be used to be explicit about UTF-8, see e.g. [How are u8-literals supposed to work?](https://stackoverflow.com/q/23471935/5987) I don't think there's any way to specify another encoding so you'll have to do a conversion. — Mark Ransom, Mar 28 '19 at 22:37
On Linux, you can use iconv to do character set conversions. — Shawn, Mar 28 '19 at 22:42
Use a Unicode conversion library, like ICONV or ICU. Or use platform-specific functions, like `MultiByteToWideChar()` and `WideCharToMultiByte()` on Windows. — Remy Lebeau, Mar 28 '19 at 23:01
@MarkRansom The compiler converts unprefixed literals to the specified or default execution character encoding. (The compiler knowing the source character encoding is a separate issue.) — Tom Blodget, Mar 29 '19 at 11:05

score 1 · Answer 1 · answered Mar 29 '19 at 02:02

To work correctly with Unicode you need to know always the encoding of your strings. This code below doesn't specifies the encoding, so this is a bad starting point if you want portable code:

std::string currency = "€";

With C++11 the simplest solution is to use a encoding prefix, for example for UTF-8 we have:

std::string currency = u8"€";

Now your string is effectively always encoded as UTF-8 on all platforms and by accessing the individual chars in the string you get the individual UTF-8 bytes.

If you don't have c++11 then you probably will use wide strings:

std::wstring currency = L"€";

And then use Unicode specific libraries (ICU, ICONV, Qt, MultiByteToWideChar, etc.) to convert your string to UTF-8.

Personally if you want to write cross platform code I would stick with C++11 and use internally for all your strings std::string and the UTF-8 encoding together with u8"...". It's so much easier.

Now about converting your UTF-8 string to Windows-1252. Certainly if you only need to convert the € and a few other UTF-8 characters then you could do it yourself with a string compare. But if the needed features (or the list of strings to convert) grows then it's probably better to use one of the already mentioned libraries. And the choice is strongly influenced by the platforms on which you want to run your code.

The Unicode world contains over 100'000 characters. There exists for example many variants of the "C" character. Do you want to ignore all of them (e.g. convert them to a question mark) and consider only the plain old "C" and "c"? or do you may want to convert also a "Ć" into a "C", so that your conversion offers more compatibility?

You may want to give a look at these questions: Portable and simple unicode string library for C/C++? and How well is Unicode supported in C++11?

How to convert utf character to windows-1252?

1 Answers1