4

I understand that std::codecvt<char16_t, char> in C++11 performs conversion between UTF-16 and UTF-8, and std::codecvt<char32_t, char> performs conversion between UTF-32 and UTF-8. Is it possible to convert between, say, UTF-8 and ISO 8859-1?

Consider:

const char* s = "\u00C0";

If I print this string and my terminal's encoding is set to UTF-8, I will see the character À. If I set my terminal's encoding to ISO 8859-1, however, printing that string will not print out the desired character. How would I convert s into a string that, when printed, will show the character À if my terminal's encoding is set to ISO 8859-1?

I understand that this can be done with a library such as iconv, but I am curious whether it can be done using only the C++ standard library. I ask this question not because I don't want to use iconv, but because I don't really understand how locales work in C++.

Brian Bi
  • 111,498
  • 10
  • 176
  • 312
  • There aren't any explicit non-Unicode encodings in the C++ standard. You have "the system's encoding" that you can convert to and from, and maybe instruct your system to be in ISO 8859-1 (maybe via an environment variable); or use an explicit conversion library such as `iconv`. – Kerrek SB Jul 03 '14 at 21:35
  • @KerrekSB, how would you convert to and from the "system encoding"? – Brian Bi Jul 03 '14 at 21:38
  • Check out the table near the bottom [of this documentation](http://en.cppreference.com/w/cpp/locale/codecvt). E.g. [`mbrtoc32`](http://en.cppreference.com/w/cpp/string/multibyte/mbrtoc32) converts from the system's narrow encoding to UTF-32. (You might wonder [where the `` header is](http://stackoverflow.com/questions/7562609/what-does-cuchar-provide-and-where-is-it-documented)...) – Kerrek SB Jul 03 '14 at 21:40

2 Answers2

3

In addition to the standard mandated encodings C++ also supports an implementation defined list of encodings via locales:

#include <locale>
#include <codecvt>
#include <iostream>

template <typename Facet>
struct usable_facet : Facet {
  using Facet::Facet;
};

using codecvt = usable_facet<std::codecvt_byname<wchar_t, char, std::mbstate_t>>;

int main() {
  std::wstring_convert<codecvt> convert(new codecvt(".1252")); // platform specific locale strings

  std::wstring w = convert.from_bytes("\u00C0");
}

Unfortunately one of the things about wchar_t is that the standard mandates only that it use a fixed width encoding for all locales, but there's no requirement that it use the same encoding in different locales, and so you can't portably convert to wchar_t using one locale and then convert that back to char using a different locale.

There is potentially some portable support for such conversions using functions like std::mbrtoc32 and related functions, but these are not yet widely implemented.

I understand that this can be done with a library such as iconv, but I am curious whether it can be done using only the C++ standard library. I ask this question not because I don't want to use iconv, but because I don't really understand how locales work in C++.

The locale library's design doesn't really lend itself to modern usage. C and C++ are themselves confused about encodings vs. character sets, and locales conflate lexical and orthographic issues with computational aspects such as encoding.

How locales work is a topic a bit broader than is suitable for a stackoverflow answer but there are books on the topic. You'd probably also need to read platform specific materials, because the standard doesn't really give any context for much of the functionality. For example the locale library supports message catalogues, but doesn't tell you what they are or how you'd actually make one because that's functionality is not standardized by C++.

Community
  • 1
  • 1
bames53
  • 86,085
  • 15
  • 179
  • 244
  • Can you give an example that actually compiles? I get an error about the codecvt object having a protected destructor. – Brian Bi Jul 03 '14 at 22:28
  • @Brian I've updated the code to fix a couple typos. The usable_facet template addresses the protected destructor issue (though in Microsoft's implementation the destructor is accessible without this trick). Note that the `new codecvt` expression does not refer to `std::codecvt`. – bames53 Jul 03 '14 at 22:43
0

If you want to convert UTF-8 to ISO 8859-1 using only the facilities of the C++ standard library:

  1. Convert UTF-8 → UTF-32 (converting to UTF-16 would also work).
  2. Each encoding value <256 is ISO 8859-1, and the others not.

Since this has an answer, while almost any other desired specific encoding would not have an answer, I suspect that the question was constructed in order to be answerable.

The standard library conversions support only one other encoding, namely the unspecified multibyte encoding of the execution character set, via e.g. mbstowcs (as a matter of formal-pedantic the wide character encoding needs not be Unicode, so formally there is another unspecified encoding, but in practice it's Unicode, i.e. UTF-16 or UTF-32).


I wondered if I should add a code example, but since there’s no interest in this answer (to the question’s “I am curious whether it can be done using only the C++ standard library”) I think it would be wasted effort.

Cheers and hth. - Alf
  • 142,714
  • 15
  • 209
  • 331