Light C Unicode Library

Question

I'm looking for a small C library to handle utf8 strings.

Specifically, splitting based on unicode delimiters for use with stemming algorithms.

Related posts have suggested:

ICU http://www.icu-project.org/ (I found it too bulky for my purposes on embedded devices)

UTF8-CPP: http://utfcpp.sourceforge.net/ (Excellent, but C++ not C)

Has anyone found any platform independent, small codebase libraries for handling unicode strings (doesn't need to do naturalisation).

utf8-cpp is great! ported smoothly to ios/android. header only libarary — barney, May 21 '16 at 15:25

score 38 · Accepted Answer · edited Jan 19 '17 at 13:50

38

A nice, light, library which I use successfully is utf8proc.

edited Jan 19 '17 at 13:50

pmttavara

698
7
16

answered Nov 24 '08 at 06:52

Avi

19,934
4
57
70

score 15 · Answer 2 · edited Jun 19 '20 at 23:27

15

There's also MicroUTF-8, but it may require login credentials to view or download the source.

edited Jun 19 '20 at 23:27

Jonathan Leffler

730,956
141
904
1,278

answered Oct 30 '11 at 12:28

xenu

159
1
2

score 13 · Answer 3 · answered Nov 24 '08 at 07:30

13

UTF-8 is specially designed so that many byte-oriented string functions continue to work or only need minor modifications.

C's strstr function, for instance, will work perfectly as long as both its inputs are valid, null-terminated UTF-8 strings. strcpy works fine as long as its input string starts at a character boundary (for instance the return value of strstr).

So you may not even need a separate library!

answered Nov 24 '08 at 07:30

Artelius

48,337
13
89
105

4

Very True, until now I had only needed to store/copy strings and was doing just that. But then I started needing to split/stem words for indexing so I wanted to make sure I was dealing with them properly. – Akusete Nov 24 '08 at 07:33
1

While they work, searching functions will probably not perform as well in the face of UTF-8 characters. For example, if a UTF-8 character can be determined to not match immediately (often possible if it's compared with an ASCII character), the entire UTF-8 character encoding, which can be multiple bytes, can be skipped. But you're right that some of C's functions will work fine with UTF-8 strings, which is one of the reasons that UTF-8 is popular. – Ethan Jan 24 '12 at 00:56
1

Not crashing is not the same than working: something as simple as the string size does not work for UTF-8. UTF-8 is NOT designed especially for library compatibility. – Adrian Maire Jul 03 '17 at 13:59

Light C Unicode Library

3 Answers3

Linked