7

I'm writing a portable library that deals with files and directories. I want to use UTF-8 for my input (directory paths) and output (file paths). The problem is, Windows gives me a choice between UTF-16-that-used-to-be-UCS-2, and codepages. So I have to convert all my UTF-8 strings to UTF-16, pass them to WinAPI, and convert the results back to UTF-8. C++11 seems to provide <locale> library just for that, except from what I understood, none of the predefined specializations uses UTF-8 as internal (ie. my-side) coding - the closest there is is UTF-16-to-UTF-8, which is the exact opposite of what I want. So here's first question:

1) How to use codecvt thingamajigs to convert my UTF-8 strings to UTF-16 for WinAPI calls, and the UTF-16 results back to UTF-8?

Another problem: I'm also targetting Linux. On Linux, there is a very good support for many different locales - and I don't want to be any different. Hopefully everyone will use UTF-8 on their Linux machines, but there is no strict guarantee of that. So I thought it would be a good idea to extend the above Windows-specific behavior and always do UTF-8-to-system-locale-coding. Except that I don't see there's any way in C++11's <locale> library to get current system encoding! Default std::locale constructor makes specified-by-myself locale, and if I don't do it, it returns classic "C" locale. And there are no other getters I'm aware of. So here's second question:

2) How to detect current system locale? Something in <locale>? Maybe some standard C library function, or (less portable but okay in this case) something in POSIX API?

Xirdus
  • 2,997
  • 6
  • 28
  • 36
  • To whoever edited this question before my rollback: the second paragraph is **NOT** part of the first question!!! – Xirdus Jul 13 '14 at 12:54
  • 2
    possible duplicate of [Convert between string, u16string & u32string](http://stackoverflow.com/questions/7232710/convert-between-string-u16string-u32string) – tclamb Jul 15 '14 at 01:55
  • @tclamb Not exactly a duplicate, but answers to that question will be helpful to me. Thanks for the link. But question 2) still stands. – Xirdus Jul 15 '14 at 15:22
  • 1
    The C++ way would be to include `` construct a locale object with an empty string as the name: `std::locale("").name()`. The C way would be to call `std::setlocale(LC_ALL, "")` from ``. – tclamb Jul 18 '14 at 16:37
  • @tclamb, yeah, I found that out yesterday, but forgot to write it. Right now I'm busy with some other stuff, but when I get to implement this, I'll write down all my conclusions in an answer. That is, unless someone else wants to answer my question (in full). – Xirdus Jul 18 '14 at 17:46
  • It’s worth noting that [`locale`s are broken on OS X with GCC/libstdc++](http://stackoverflow.com/a/11192040/1968). If you want to be platform independent you need to use something else. – Konrad Rudolph Aug 08 '14 at 18:46

1 Answers1

0

The design of these facilities in the standard library assumes that multibyte character encodings (like UTF-8) are used only for external storage (i.e. byte sequences in files on disk) and that all characters in memory are uniform in size. This is so things like std::basic_string<T>::operator[] can behave in a manner consistent with the performance constraints imposed by the standard. So while you can use files encoded in UTF-8 or some other MBCS (like those for Japanese), your strings in memory should be char, char16_t, char32_t or wchar_t.

This is why you aren't finding a match in the standard library for what you want to do because strings in memory aren't intended to be stored in UTF-8. This is similar to other languages as well, such as Java, where data on disk is interpreted as a stream of bytes and to turn them into strings you need to tell some component the expected character encoding of the byte stream. Some operating systems may stuff a UTF-8 string into argv[], but this is non-standard. This is the reason that the Unicode enabled entry point for WinMain on Windows provides a NUL terminated pointer to wchar_t and not a char* pointing to a UTF-8 encoded string.

IBM's International Components for Unicode library provides a whole set of components that are complementary to, and design to work with, the C++ standard library. I would look at their code conversion facilities. While the standard defines facilities in <locale> for code conversion, it doesn't guarantee any existence of a code conversion facility to map from UTF-8 to char16_t, char32_t, or wchar_t. If such a thing exists, you'll only get it based on the details of your implementation. The ICU library provides this functionality portably for any C++ implementation. It is well supported and well used and unlikely to have bugs decoding UTF-8 strings into the appropriate wider-than-char string.

Konrad mentioned the UTF-8 Anywhere Manifesto in a comment. This was an interesting read and they point you to the Boost.Nowide library (not officially a part of boost yet) to get solutions to the problems you cite above.

Please note that my answer is simply a description of the way the existing C++ standard library classes like std::basic_string<T> work. It is not advice against UTF-8, Unicode, or anything else. The manifesto cited agrees with me that these things simply don't work this way and if you want to use UTF-8 anywhere, then you need something else.

legalize
  • 2,214
  • 18
  • 25
  • 2
    “strings in memory aren't intended to be stored in UTF-8. ” — No. You are wrong here. The [UTF-8 Anywhere manifesto](http://www.utf8everywhere.org/) disagrees with you, and that document seen as a pretty good summary by many developers. In summary, `std::string` is an okay container for UTF-8, and the standard library doesn’t offer sufficient facilities for working with Unicode anyway, regardless of underlying character type. – Konrad Rudolph Aug 08 '14 at 18:40
  • `std::string` is not designed for multi-byte character sets, it's just that simple. You can disagree with that design and create an alternative string class that is MBCS aware, but `std::string` simply doesn't work that way. Everything in the standard library assumes that all characters are encoded in the same number of bits. When I say "strings in memory aren't intended to be stored in UTF-8", I am specifically referring to the typedefs for `std::string`, `std::wstring`, and the underling template class `std::basic_string`. "Substring methods will happily return an invalid string" – legalize Aug 08 '14 at 22:43
  • 2
    The standard library is simply deficient in dealing with text, it’s no use arguing from its design. `std::string` is a byte storage, not a text storage. But it’s entirely permissible to use `std::string` as a transparent storage for UTF-8 encoded text, as long as you don’t operate on it in an encoding agnostic way. For that you’ll need a library (such as ICU, or [Ogonek](http://flamingdangerzone.com/ogonek/), which has an infinitely nicer C++ interface but is still incomplete). – Konrad Rudolph Aug 09 '14 at 11:17
  • 1
    I feel we are talking about two different things but using the same words. The standard library is perfectly fine for text assuming equal number of bits for every character. That assumption doesn't hold for UTF-8. It may be deficient for UTF-8, but that doesn't make it deficient in dealing with text. It simply isn't designed for any MBCS character encoding. If you're going to mean "UTF-8 strings" every time you say string and "UTF-8 text" every time you say text, it would help if you would clarify your terminology to avoid confusion. – legalize Aug 09 '14 at 20:00
  • 1
    No, I meant text in general, not one particular encoding. The article I’ve linked lists some of the deficiencies C++ has (try capitalising “ß”, for instance). But this is emphatically not my point. If you need to actually work with Unicode (not UTF-8) text, you quickly need a third-party library. But if you just need to accept text and pass it on to something else unchanged, then using `std::string` to store the raw bytes is entirely fine, and this is particularly true for UTF-8, which should be used almost exclusively for this purpose (that’s what the manifesto is arguing for). – Konrad Rudolph Aug 10 '14 at 00:05