17

Am wondering how to normalize strings (containing utf-8/utf-16) in C/C++. In .NET there is a function String.Normalize .

I used UTF8-CPP in the past but it does not provide such a function. ICU and Qt provide string normalization but I prefer lightweight solutions.

Is there any "lightweight" solution for this?

Ghassen Hamrouni
  • 3,138
  • 2
  • 20
  • 31

5 Answers5

11

As I wrote in another question, utf8proc is a very nice, lightweight, library for basic Unicode functionality, including Unicode string normalization.

Zoë Peterson
  • 13,094
  • 2
  • 44
  • 64
Avi
  • 19,934
  • 4
  • 57
  • 70
  • I have problems in visual studio 2010 with utf8proc. typedef unsigned char bool; -> dosen't compile in C++ – Ghassen Hamrouni Feb 03 '11 at 11:01
  • I don't have familiarity with VS 2010, but can't you compile the library as a C library, and link it in that way? – Avi Feb 03 '11 at 11:24
  • The problem is in the header file that's why we can't use it even as a static library. A simple workaround is to replace bool, true, false to _bool, _true, _false occurences. Example : typedef unsigned char _bool; enum {_false, _true}; – Ghassen Hamrouni Feb 03 '11 at 12:47
  • Yes, you could probably do that without too much trouble - it isn't a very complicated library. We also had to make one or two minor changes like that. – Avi Feb 03 '11 at 14:09
  • 2
    The Julia team has an updated fork of utf8proc called libmojibake (https://github.com/JuliaLang/libmojibake) which is updated for Unicode 7 support. (It also has some other small fixes, e.g. it fixed C++ compatibility.) – Steven G. Johnson Sep 15 '14 at 23:58
4

For Windows, there is the NormalizeString() function (unfortunately for Vista and later only - as far as I see on MSDN):

http://msdn.microsoft.com/en-us/library/windows/desktop/dd319093%28v=vs.85%29.aspx

It's the simplest way to go that I have found so far. I guess it's quite lightweight too.

int NormalizeString(
    _In_      NORM_FORM NormForm,
    _In_      LPCWSTR   lpSrcString,
    _In_      int       cwSrcLength,
    _Out_opt_ LPWSTR    lpDstString,
    _In_      int       cwDstLength
);
NoOne
  • 3,851
  • 1
  • 40
  • 47
2

A good UTF-8 solution is glib's g_utf8_normalize() function. Would require to convert std::wstring to std::string (utf16 to utf8) if you need this for wstring too (which would make it quite an expensive solution, hence I'm looking myself for a better solution, if possible with pure C++(11) means).

Mike Lischke
  • 48,925
  • 16
  • 119
  • 181
2

You could build ICU with minimal (or possibly, no other data- I think all of the normalization data is now internal), and then statically link. I haven't tried this recently, but I believe the total size is pretty small in that case.

Steven R. Loomis
  • 4,228
  • 28
  • 39
1

"Lightweight" in your context means "with limited functionality". I would use ICU source as an example, and reference http://unicode.org/reports/tr15/ to implement this "lightweight" functionality.

Greg Smirnov
  • 1,592
  • 9
  • 9