Looking for simple practical C++ examples of how to use ICU

Question

I am looking for simple practical C++ examples on how to use ICU.
The ICU home page is not helpful in this regard.
I am not interested on what and why Unicode.
The few demos are not self contained and not compilable examples ( where are the includes? )
I am looking for something like 'Hello, World' of:
How to open and read a file encoded in UTF-8
How to use STL / Boost string functions to manipulate UTF-8 encoded strings etc.

Did you see this question: http://stackoverflow.com/questions/402283/stdwstring-vs-stdstring — yasouser, May 15 '11 at 20:37

score 13 · Accepted Answer · edited Feb 27 '23 at 16:58

13

There's no special way to read a UTF-8 file unless you need to process a byte order mark (BOM). Because of the way UTF-8 encoding works, functions that read ANSI strings can also read UTF-8 strings.

The following code will read the contents of a file (ANSI or UTF-8) and do a couple of conversions.

#include <fstream>
#include <string>

#include <unicode/unistr.h>

int main(int argc, char** argv) {
    std::ifstream f("...");
    std::string s;
    while (std::getline(f, s)) {
        // at this point s contains a line of text
        // which may be ANSI or UTF-8 encoded

        // convert std::string to ICU's UnicodeString
        UnicodeString ucs = UnicodeString::fromUTF8(StringPiece(s.c_str()));

        // convert UnicodeString to std::wstring
        std::wstring ws;
        for (int i = 0; i < ucs.length(); ++i)
            ws += static_cast<wchar_t>(ucs[i]);
    }
}

Take a look at the online API reference.

If you want to use ICU through Boost, see Boost.Locale.

edited Feb 27 '23 at 16:58

Ted Lyngmo

93,841
5
60
108

answered May 15 '11 at 23:15

Ferruccio

98,941
38
226
299

8

This code is wrong for any platform where wchar_t is not 16-bit, as ucs.getBuffer() always returns a pointer to UTF-16 data. – wjl Jun 07 '11 at 23:08
Is `std::getline` sufficient ? I'm assuming it wouldn't recognize a `U+2028` for instance ? – lmat - Reinstate Monica Mar 06 '14 at 21:54
@LimitedAtonement - `std::getline()` doesn't know anything about character encoding; it simply reads a string of bytes until it sees a `\n`. `UnicodeString::fromUTF8()` is responsible for recognizing that a series of bytes represents a Unicode code point and convert them accordingly. In this case, the UTF-8 representation of U+2028 is `E2 80 A8`. `std::getline()` will have no problem reading those bytes. – Ferruccio Mar 06 '14 at 22:52
@Ferruccio the comment "at this point s contains a line of text...UTF-8 encoded" is incorrect then? Any (multi-byte) character with a `0x0a` (`\n`) byte in it will be slaughtered, right ? (I guess I'll have to TIAS :) ) – lmat - Reinstate Monica Mar 07 '14 at 14:30
@Ferruccio It appears that I can stand to be corrected. `\n` in UTF-8 only occurs in the single-byte character `0x0a`. The problem exists if the input text is UTF-16, however, I think. – lmat - Reinstate Monica Mar 07 '14 at 14:38
@LimitedAtonement - you could always use `std::wifstream` to process UTF16 data, but that would require that you know its format before opening the file. – Ferruccio Mar 07 '14 at 15:41
**− 1** Show the `#include`s. – Cheers and hth. - Alf Jul 05 '17 at 01:48
**0** Removed downvote because the question is fixed. Thanks! – Cheers and hth. - Alf Jul 06 '17 at 17:57
ICU has a function u_strToWCS which can convert a UnicodeString to a std::wstring – Superfly Jon Feb 24 '23 at 14:20

score 10 · Answer 2 · answered May 17 '11 at 18:19

ICU ≠ Boost, so you will find example of how to use ICU functions to manipulate strings, but not Boost.
Which samples are you looking at? There are samples within the ICU source tree, under icu/source/samples - I think the converter samples there open and close utf-8, also icu/source/extras/uconv which is an 'iconv' like application.
more samples at http://source.icu-project.org/repos/icu/icuapps/trunk/

hope this helps

Looking for simple practical C++ examples of how to use ICU

2 Answers2