Working with UTF-8 strings and characters in C++

Question

I'm working on a project which works on utf-8 strings character by character, however I was unable to find a way to work on UTF-8 strings on that manner in C++.

What I need is:

The strings need to be UTF-8, since the strings won't be limited to English alphabet.
Storing and retrieving them as-is is insufficient, since I'll work on them character by character and process them.
Accessing them character by character, and being able to compare them with other UTF-8 characters is a requirement.

Suggestion of any C++ (regardless of 98/11/14) feature or library is very welcome.

Additional points for not using Boost. I have a tendency to develop tools without external dependencies.

Have you heard of [ICU](http://site.icu-project.org/design/cpp)? — Eljay, Oct 21 '18 at 19:04
This answer (and the one it references) should provide what you need: https://stackoverflow.com/questions/37989081/how-to-use-unicode-range-in-c-regex/37990517#37990517 — Galik, Oct 21 '18 at 19:06
Possible duplicate https://stackoverflow.com/questions/43302279/any-good-solutions-for-c-string-code-point-and-code-unit — Galik, Oct 21 '18 at 19:08
Code points aren’t characters. You’ll end up reimplementing half of icu, and badly at that, by attempting to do it yourself for *characters*. If you truly want to iterate characters, then you need icu and that’s that. If you need to iterate code points, you’re asking for code without trying anything and thus the question is off-topic. Show how you tried to decode code points from utf-8 and we can help you fix it should it have bugs :) — Kuba hasn't forgotten Monica, Oct 21 '18 at 19:19
Standard C++ already has `utf-8` to `ucs-16`/`utf-32` converters, No need for an external library. — Galik, Oct 21 '18 at 19:20
@KubaOber in C++ `std::string` context every char is *half* of a two-byte utf-8 character, and I used *code point* to point that I need the complete character that these two bytes point to. I've updated the question. It's just a terminology misuse by me, sorry. — bayindirh, Oct 21 '18 at 19:22
If you mean code points then don’t use the word “character” anywhere… I still don’t know if you want characters or code points. — Kuba hasn't forgotten Monica, Oct 21 '18 at 19:25
@Galik, thanks a lot for your comments. At the core, the storage of the text is not the problem, the problem is to access these two-byte characters as single characters during iteration. I need to see these two-bytes as single characters, otherwise I cannot process them. — bayindirh, Oct 21 '18 at 19:32
@KubaOber I want the characters, not the bytes of that particular character. — bayindirh, Oct 21 '18 at 19:35
@bayindirh That's exactly what my links give you. Here is what you can use: https://stackoverflow.com/a/43302460/3807729 — Galik, Oct 21 '18 at 20:22
@Galik, I'm sorry for the confusion. I've answered your last comment only. Your other links are highly useful, and thanks for that again. — bayindirh, Oct 21 '18 at 20:31
You should state an operating system. On Linux you often use [`iconv(3)`](https://linux.die.net/man/3/iconv) for free and open source projects. On Windows you often use the Win32 API. — jww, Oct 22 '18 at 01:22
Possible duplicate of [Any good solutions for C++ string code point and code unit?](https://stackoverflow.com/questions/43302279/any-good-solutions-for-c-string-code-point-and-code-unit) — , Oct 22 '18 at 01:30

score 1 · Answer 1 · answered Oct 21 '18 at 19:05

1

C++ is notorious for having very very poor support for unicode out of the box. So the best option is to use a library like ICU or boost.

Friendly advice:

I have a tendency to develop tools without external dependencies

You need to justify this statement, otherwise, if it's an arbitrary rule of yours you limit yourself. Libraries, like languages are tools. Choosing what tools to use needs to be analyzed and the benefits weighted against the downsides.

answered Oct 21 '18 at 19:05

bolov

72,283
15
145
224

1

Thanks for the advice! I like to use libraries which I can embed into my source tree completely because of various reasons. First of all it removes the burden of installing development packages of big libraries just for compiling a small utility (of mine), then it removes the burden of code maintenance to keep library compatibility as the library evolves. Lastly it makes the tool more portable since I don't always have the luxury to install lots of dev packages to compile my tool. However, it the best way is to use boost or other so-called big library, I'll happily use it at the end of the day – bayindirh Oct 21 '18 at 19:11
1

@bayindirh You can use vcpkg (https://github.com/Microsoft/vcpkg) for building and integrating almost any major C++ library nowadays. It's been getting a lot of traction and it is equally good for both rapid prototyping with usage of third-party libs, and for enterprise scenarios (see their `export` command) – ivanmoskalev Oct 21 '18 at 19:12
1

@bayindirh solid argument. If you know what you are doing, which it looks like you do, you are the only judge who can tell if implementing utf8 support yourself is worth it or not. – bolov Oct 21 '18 at 19:12
@bolov, thanks. I'll take a look in ``, `libICU`, and others, and if I can find a embeddable library (like `eigen`, `easylogging++`, etc.) I'll use it without hesitation. This is a personal project, so no time pressure is present. I'll try to strike a healthy balance between challenge, not-invented-here and pragmatism. – bayindirh Oct 21 '18 at 19:19

score 1 · Answer 2 · edited Oct 07 '21 at 11:39

1

You mean, working with code points (as opposed to the actual chars – i.e. bytes)? A small addition to the answer above. I would recommend you to first read the specs on how UTF-8 works, then probably read the "UTF-8 Everywhere" manifesto, and also look here – it is a nice example of how to build a UTF-8 code point iterator. It is always good to know how stuff actually works, especially if it is an important part of your software. Though you will most certainly end up using ICU :-)

edited Oct 07 '21 at 11:39

Community

1
1

answered Oct 21 '18 at 19:08

ivanmoskalev

2,004
1
16
25

Actually I need to access the characters itself. The text I'm going to process is guaranteed to have two-byte unicode characters, and I need to access them without seeing their different bytes. Since C++ can store and read unicode strings in `std::string`, by dividing the bytes internally and behaving indifferently to these binary values, I used *code points* to explicitly point that I need to access two-byte characters as characters itself, not individual bytes of these two-byte characters. – bayindirh Oct 21 '18 at 19:15
Yeah, sorry, I understood what you meant, just expressed my idea not too well. By actual chars I meant `char`s – byte values. Edited the answer. – ivanmoskalev Oct 21 '18 at 19:17

score 0 · Answer 3 · answered Oct 21 '18 at 19:24

You can use Wide Chars ( or also Multibytes ) for handling Unicode

In https://www.geeksforgeeks.org/wide-char-and-library-functions-in-c/ is a summary of C++ library functions for Wide Chars

Also see the Internationalization standards like I18N and cf https://www.cprogramming.com/tutorial/unicode.html

Working with UTF-8 strings and characters in C++

3 Answers3