24

I am looking for simple practical C++ examples on how to use ICU.
The ICU home page is not helpful in this regard.
I am not interested on what and why Unicode.
The few demos are not self contained and not compilable examples ( where are the includes? )
I am looking for something like 'Hello, World' of:
How to open and read a file encoded in UTF-8
How to use STL / Boost string functions to manipulate UTF-8 encoded strings etc.

user754425
  • 437
  • 1
  • 4
  • 10

2 Answers2

13

There's no special way to read a UTF-8 file unless you need to process a byte order mark (BOM). Because of the way UTF-8 encoding works, functions that read ANSI strings can also read UTF-8 strings.

The following code will read the contents of a file (ANSI or UTF-8) and do a couple of conversions.

#include <fstream>
#include <string>

#include <unicode/unistr.h>

int main(int argc, char** argv) {
    std::ifstream f("...");
    std::string s;
    while (std::getline(f, s)) {
        // at this point s contains a line of text
        // which may be ANSI or UTF-8 encoded

        // convert std::string to ICU's UnicodeString
        UnicodeString ucs = UnicodeString::fromUTF8(StringPiece(s.c_str()));

        // convert UnicodeString to std::wstring
        std::wstring ws;
        for (int i = 0; i < ucs.length(); ++i)
            ws += static_cast<wchar_t>(ucs[i]);
    }
}

Take a look at the online API reference.

If you want to use ICU through Boost, see Boost.Locale.

Ted Lyngmo
  • 93,841
  • 5
  • 60
  • 108
Ferruccio
  • 98,941
  • 38
  • 226
  • 299
  • 8
    This code is wrong for any platform where wchar_t is not 16-bit, as ucs.getBuffer() always returns a pointer to UTF-16 data. – wjl Jun 07 '11 at 23:08
  • Is `std::getline` sufficient ? I'm assuming it wouldn't recognize a `U+2028` for instance ? – lmat - Reinstate Monica Mar 06 '14 at 21:54
  • @LimitedAtonement - `std::getline()` doesn't know anything about character encoding; it simply reads a string of bytes until it sees a `\n`. `UnicodeString::fromUTF8()` is responsible for recognizing that a series of bytes represents a Unicode code point and convert them accordingly. In this case, the UTF-8 representation of U+2028 is `E2 80 A8`. `std::getline()` will have no problem reading those bytes. – Ferruccio Mar 06 '14 at 22:52
  • @Ferruccio the comment "at this point s contains a line of text...UTF-8 encoded" is incorrect then? Any (multi-byte) character with a `0x0a` (`\n`) byte in it will be slaughtered, right ? (I guess I'll have to TIAS :) ) – lmat - Reinstate Monica Mar 07 '14 at 14:30
  • @Ferruccio It appears that I can stand to be corrected. `\n` in UTF-8 only occurs in the single-byte character `0x0a`. The problem exists if the input text is UTF-16, however, I think. – lmat - Reinstate Monica Mar 07 '14 at 14:38
  • @LimitedAtonement - you could always use `std::wifstream` to process UTF16 data, but that would require that you know its format before opening the file. – Ferruccio Mar 07 '14 at 15:41
  • **− 1** Show the `#include`s. – Cheers and hth. - Alf Jul 05 '17 at 01:48
  • **0** Removed downvote because the question is fixed. Thanks! – Cheers and hth. - Alf Jul 06 '17 at 17:57
  • ICU has a function u_strToWCS which can convert a UnicodeString to a std::wstring – Superfly Jon Feb 24 '23 at 14:20
10
  • ICU ≠ Boost, so you will find example of how to use ICU functions to manipulate strings, but not Boost.

  • Which samples are you looking at? There are samples within the ICU source tree, under icu/source/samples - I think the converter samples there open and close utf-8, also icu/source/extras/uconv which is an 'iconv' like application.

  • more samples at http://source.icu-project.org/repos/icu/icuapps/trunk/

hope this helps

Steven R. Loomis
  • 4,228
  • 28
  • 39