Choosing encoding for icu::UnicodeString

Question

I found myself in need of a way to change a string to lower case that was safe to use for ASCII and for UTF16-LE (as found in some windows registry strings) and came across this question: How to convert std::string to lower case?

The answer that seemed to be the "most correct" to me (I'm not using Boost) was one that demonstrated using the icu library.

In this answer, he specified the encoding "ISO-8859-1" for the UnicodeString constructor. Why is this the correct value and how do I know what to use?

ISO-8859-1 has worked for the few unit tests I've run against ASCII encoded strings that used only Latin characters, but I don't like using it if I don't know why.

If it matters, I'm mainly concerned with manipulating English data that is typically stored in ASCII, but the windows registry has the ability to store things in UTF-16LE and I don't want to block myself from supporting other languages down the road by littering my code with non-unicode safe stuff.

If the UTF-16 string contains text that doesn't fit in ASCII... what ASCII character do you want stored in the result of this transformation? — Nicol Bolas, Dec 29 '15 at 15:16
Bytes are just bytes, you simply have to know their encoding. To some extend you can make educated guesses, but those still remain guesses. See for example the "this app may fail" disaster on some Windows builtin editor some years ago. — Ulrich Eckhardt, Dec 29 '15 at 15:17
@NicolBolas: I clarified the question a bit as to my reasoning for the case conversion. I'm not converting between UTF-16LE and ASCII. I just need to be able to strlower() a string so I can compare in a case-insensitive way without caring whether the string is ASCII or UTF-16LE (in my code the two strings being compared will always match in encoding so I'm never comparing ASCII to UTF16-LE) — Matthew, Dec 29 '15 at 17:12
@Matthew `"ISO-8859-1"` was just an example. You specify the encoding used for your string. If your strings have the same encoding, you still need to know which encoding to do case folding. If you have a `WCHAR*` from the Windows API, I think you can use the `UnicodeString(const UChar *text)` constructor. — roeland, Dec 29 '15 at 23:02

score 1 · Accepted Answer · answered Dec 30 '15 at 05:09

I found myself in need of a way to change a string to lower case for the purpose of case-insensitive string comparison

UnicodeString in ICU has many caseCompare() methods for performing comparisons "case-insensitively using full case folding". You don't need to transform your strings manually.

In this answer, he specified the encoding "ISO-8859-1" for the UnicodeString constructor. Why is this the correct value and how do I know what to use?

Because the author is passing an ISO-8859-1 encoded char* string literal to the constructor. UnicodeString represents a UTF-16 encoded string. If you construct it using a char* as input, you have to specify the correct charset the input data is encoded with so UnicodeString can decode it to Unicode and then re-encode it as UTF-16.

Choosing encoding for icu::UnicodeString

1 Answers1

Linked