I am looking for simple practical C++ examples on how to use ICU.
The ICU home page is not helpful in this regard.
I am not interested on what and why Unicode.
The few demos are not self contained and not compilable examples ( where are the includes? )
I am looking for something like 'Hello, World' of:
How to open and read a file encoded in UTF-8
How to use STL / Boost string functions to manipulate UTF-8 encoded strings
etc.

- 437
- 1
- 4
- 10
-
2Did you see this question: http://stackoverflow.com/questions/402283/stdwstring-vs-stdstring – yasouser May 15 '11 at 20:37
2 Answers
There's no special way to read a UTF-8 file unless you need to process a byte order mark (BOM). Because of the way UTF-8 encoding works, functions that read ANSI strings can also read UTF-8 strings.
The following code will read the contents of a file (ANSI or UTF-8) and do a couple of conversions.
#include <fstream>
#include <string>
#include <unicode/unistr.h>
int main(int argc, char** argv) {
std::ifstream f("...");
std::string s;
while (std::getline(f, s)) {
// at this point s contains a line of text
// which may be ANSI or UTF-8 encoded
// convert std::string to ICU's UnicodeString
UnicodeString ucs = UnicodeString::fromUTF8(StringPiece(s.c_str()));
// convert UnicodeString to std::wstring
std::wstring ws;
for (int i = 0; i < ucs.length(); ++i)
ws += static_cast<wchar_t>(ucs[i]);
}
}
Take a look at the online API reference.
If you want to use ICU through Boost, see Boost.Locale.

- 93,841
- 5
- 60
- 108

- 98,941
- 38
- 226
- 299
-
8This code is wrong for any platform where wchar_t is not 16-bit, as ucs.getBuffer() always returns a pointer to UTF-16 data. – wjl Jun 07 '11 at 23:08
-
Is `std::getline` sufficient ? I'm assuming it wouldn't recognize a `U+2028` for instance ? – lmat - Reinstate Monica Mar 06 '14 at 21:54
-
@LimitedAtonement - `std::getline()` doesn't know anything about character encoding; it simply reads a string of bytes until it sees a `\n`. `UnicodeString::fromUTF8()` is responsible for recognizing that a series of bytes represents a Unicode code point and convert them accordingly. In this case, the UTF-8 representation of U+2028 is `E2 80 A8`. `std::getline()` will have no problem reading those bytes. – Ferruccio Mar 06 '14 at 22:52
-
@Ferruccio the comment "at this point s contains a line of text...UTF-8 encoded" is incorrect then? Any (multi-byte) character with a `0x0a` (`\n`) byte in it will be slaughtered, right ? (I guess I'll have to TIAS :) ) – lmat - Reinstate Monica Mar 07 '14 at 14:30
-
@Ferruccio It appears that I can stand to be corrected. `\n` in UTF-8 only occurs in the single-byte character `0x0a`. The problem exists if the input text is UTF-16, however, I think. – lmat - Reinstate Monica Mar 07 '14 at 14:38
-
@LimitedAtonement - you could always use `std::wifstream` to process UTF16 data, but that would require that you know its format before opening the file. – Ferruccio Mar 07 '14 at 15:41
-
-
**0** Removed downvote because the question is fixed. Thanks! – Cheers and hth. - Alf Jul 06 '17 at 17:57
-
ICU has a function u_strToWCS which can convert a UnicodeString to a std::wstring – Superfly Jon Feb 24 '23 at 14:20
ICU ≠ Boost, so you will find example of how to use ICU functions to manipulate strings, but not Boost.
Which samples are you looking at? There are samples within the ICU source tree, under icu/source/samples - I think the converter samples there open and close utf-8, also icu/source/extras/uconv which is an 'iconv' like application.
more samples at http://source.icu-project.org/repos/icu/icuapps/trunk/
hope this helps

- 4,228
- 28
- 39