7

What open source C or C++ libraries can convert arbitrary UTF-32 to NFC?

Libraries that I think can do this so far: ICU, Qt, GLib (not sure?).

I don't need any other complex Unicode support; just conversion from arbitrary but known-correct UTF-32 to UTF-32 that is in NFC form.

I'm most interested in a library that can do this directly. For example, Qt and ICU (as far as I can tell) both do everything via an intermediate conversion stage to and from UTF-16.

wjl
  • 7,519
  • 2
  • 32
  • 41
  • What is NFC? Unicode Normalization Form Canonical Composition? – Billy ONeal Nov 24 '11 at 06:46
  • 1
    @BillyONeal: I'm pretty sure that is it. See http://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms – wallyk Nov 24 '11 at 06:49
  • 1
    Why do you care about implementation details? I wouldn't care if a library used UTF-13 internally, as long as it produces the right results. – MSalters Nov 24 '11 at 10:37
  • 2
    "I don't need complex Unicode support" is a strange requirement. Surely, normalization *is* a very complex operation that requires full access to the Unicode character database... – Kerrek SB Nov 24 '11 at 15:02
  • @MSalters you are right that implementation don't matter to a large extent. However, I'm using C++ because I care about memory usage and execution time: a single intermediate conversion could easily double both. If I didn't care *at all*, I'd just use python and be done with it. =) – wjl Nov 24 '11 at 17:31
  • @Kerrek I didn't say it's a *requirement* that the library doesn't have complex Unicode support, I just don't *need* anything except UTF-32 to UTF-32 NFC conversion. For example, Qt is MUCH, MUCH simpler than ICU in it's Unicode support, but both support normalization. – wjl Nov 24 '11 at 17:39
  • What is the output destined for that requires NFC, and why is an intermediate conversion undesirable? – rvalue Dec 01 '11 at 04:53

2 Answers2

2

ICU or Boost.Locale (wrapping ICU) will be your best by a very, very long way. The normalisation mappings will be equivalent with those from more software, which I assume is the point of this conversion.

rvalue
  • 2,652
  • 1
  • 25
  • 31
  • There is only one possible (correct) NFC normalization mapping, so there isn't any compatibility worry, but I suppose that ICU is perhaps the least likely to be ever be buggy. I was hoping for something a little lighter-weight that could just do normalization, but I after lots of looking, ended up deciding that ICU was the best choice as well, so I'm marking this as accepted. =) – wjl Dec 01 '11 at 05:21
  • To clarify, by compatibility I mean as always: 'both sides will likely have the same bugs' =) – rvalue Dec 01 '11 at 05:58
0

Here is the main part of the code I ended up using after deciding on ICU. I figured I should put it here in case it helps someone who tries this same thing.

std::string normalize(const std::string &unnormalized_utf8) {
    // FIXME: until ICU supports doing normalization over a UText
    // interface directly on our UTF-8, we'll use the insanely less
    // efficient approach of converting to UTF-16, normalizing, and
    // converting back to UTF-8.

    // Convert to UTF-16 string
    auto unnormalized_utf16 = icu::UnicodeString::fromUTF8(unnormalized_utf8);

    // Get a pointer to the global NFC normalizer
    UErrorCode icu_error = U_ZERO_ERROR;
    const auto *normalizer = icu::Normalizer2::getInstance(nullptr, "nfc", UNORM2_COMPOSE, icu_error);
    assert(U_SUCCESS(icu_error));

    // Normalize our string
    icu::UnicodeString normalized_utf16;
    normalizer->normalize(unnormalized_utf16, normalized_utf16, icu_error);
    assert(U_SUCCESS(icu_error));

    // Convert back to UTF-8
    std::string normalized_utf8;
    normalized_utf16.toUTF8String(normalized_utf8);

    return normalized_utf8;
}
wjl
  • 7,519
  • 2
  • 32
  • 41