1

As I understand it, different locales have different encodings. With ICU I'd like to convert from a UnicodeString to the current locale's encoding, and back. Specifically I'm using Boost's Filesystem library, which in turn uses either Windows' UTF-16, or Linux's UTF-8 encodings.

Is there a way to reliably do this using ICU, or another library?

Jookia
  • 6,544
  • 13
  • 50
  • 60
  • 1
    ICU is a pretty heavy library, it's probably an overkill for this simple task. You may want to consider `libiconv` on Linux and `WideCharToMultiByte` and `MultiByteToWideChar` on Windows. Though you can use ICU too if you really want to. – n. m. could be an AI Sep 10 '11 at 14:35
  • Ah. I don't know, I just want Unicode support in my application. – Jookia Sep 10 '11 at 16:15
  • possible duplicate of [ICU UnicodeString to Locale Encoding](http://stackoverflow.com/questions/7370679/icu-unicodestring-to-locale-encoding) – tchrist Sep 10 '11 at 18:55

2 Answers2

2

You can use ICU, but you may find iconv() sufficient, which is a lot simpler to set up and operate (and it's part of Posix, and easily available for Windows).

With either library, you have to convert your unicode string to a wide string. In iconv() that target is called WCHAR_T. Once you have a wide char, you can use it directly in Windows.

In Linux, you can either proceed to use wcstombs() to transform the wide character into the system's (and locale's) narrow character multibyte encoding (don't forget setlocale(LC_CTYPE, "");), or, alternatively, if you are sure that you want UTF-8 rather than the system's encoding, you can transform from your original string to UTF-8 directly (also with either library).

Maybe you'll find this post of mine to provide some background.

Community
  • 1
  • 1
Kerrek SB
  • 464,522
  • 92
  • 875
  • 1,084
  • 1
    Am I going to have to end up making a string class for Unicode stuff? – Jookia Sep 10 '11 at 16:31
  • I'd just use `std::vector` for the raw codepoint strings, or `std::vector` if you can. There's also a `std::u32string` (a typedef of `std::basic_string`), but since you cannot write to a string's data buffer, the vector is better. You can always say `std::u32string(v.begin(), v.end())` when you're done... – Kerrek SB Sep 10 '11 at 16:33
  • Would it be wise to just find a UTF-8 string library (I doubt my problem will go higher than the ASCII set, or the BMP on top of that), then add some iconv stuff to it? – Jookia Sep 10 '11 at 17:18
  • iconv is really not very heavy to use. I'd add a couple of clean interface functions that you use at the program boundaries, and then just stick to one string encoding internally -- whichever is most suitable for your needs. – Kerrek SB Sep 10 '11 at 18:47
  • Yes, but I don't know how I'd go about having a string class like std::string, but based on code points when using a non 32 encoding. – Jookia Sep 10 '11 at 19:01
  • You can either use `std::vector`, or `std::u32string`, or even `std::basic_string`. – Kerrek SB Sep 10 '11 at 19:04
  • But isn't that a bit of a waste of space? – Jookia Sep 10 '11 at 19:15
  • 1
    Look -- how much text data do you have in your application? Will the difference between 1 and 4 megabyte really have such an impact, even if you have a million characters? Ultimately, just pick whichever fits best into your work flow. If you need to manipulate codepoints, I'd go with UTF32. If you're just storing and echoing, UTF8 in an `std::string` will do fine. If you like standard-C, keep everything as a wide string, and you get Windows filesystem support for free. There are many things to consider, and you have many options, but space isn't a concern. – Kerrek SB Sep 10 '11 at 19:33
  • I guess that's the way to go then. – Jookia Sep 10 '11 at 19:50
  • By the way, [here is some sample code](http://stackoverflow.com/questions/7141260/compare-stdwstring-and-stdstring/7159944#7159944) to showcase how I'd encapsulate the transformations into nice C++ interfaces. You can rig up something similar using `iconv()` for the other conversions. – Kerrek SB Sep 10 '11 at 20:05
1

Use iconv. http://www.gnu.org/s/libiconv/documentation/libiconv/iconv.1.html It is pre-installed on most of the GNU systems.

Aarkan
  • 3,811
  • 6
  • 40
  • 54