String handling with Nordic characters is difficult in C++

Question

I have tried many ways to solve this problem. I just want to part a string or do stuff with each character. As soon as there are Nordic characters in the string, it's not possible to part that string.

The length() function returns the right answer if we look at memory use, but that's not the same as the string length. "ABCÆØÅ" does not have 6 as the length, is has 9. One extra for each special character.

Anybody with a good answer??

The test under here, shows the problem, some letters and a lot of ? marks. :-(

int main()
{
   string name = "some æøå string";
   for_each(name.begin(), name.end(), [] (char c) {
      cout << c;
      cout << endl;

   });
}

Pick an encoding (e.g., UTF-8) and find a library that offers APIs for that encoding until standard Unicode support gets fleshed out. — chris, Jul 28 '20 at 18:09
Also explore Unicode, UTF-8 and maybe in case of nordic characters, [ISO-8859-10](https://en.wikipedia.org/wiki/ISO/IEC_8859-10). Further reading: [How to use Unicode in C++?](https://stackoverflow.com/questions/3010739/how-to-use-unicode-in-c) — rustyx, Jul 28 '20 at 18:10
@Jan ENCODING is important. `"ABCÆØÅ"` has 9 `char`s when encoded in [UTF-8](https://en.wikipedia.org/wiki/UTF-8) (`0x41 0x42 0x43 0xC3 0x86 0xC3 0x98 0xC3 0x85`), and 6 `char`s when encoded in [ISO-8859-10](https://en.wikipedia.org/wiki/ISO/IEC_8859-10) (`0x41 0x42 0x43 0xC6 0xD8 0xC5`). You can't treat **multi-byte encodings** as individual `char`s, you lose information that way, which is why you see a bunch of `?` in your display. You have to take encoding into account when parsing text data. If you don't know the encoding, ask the user, don't guess it (you are likely to guess wrong) — Remy Lebeau, Jul 28 '20 at 18:20
Well encoding does not help me as far as I have tried so far. I need to extract each character, there is no way to this when Length says 9 but there only is 6 characters. I have no way to know when it is a character is one or two byte long. — Jan, Jul 29 '20 at 14:50

Pablo Yaggi · Answer 1 · 2020-07-28T18:31:14.017

0

If your terminal supports utf-8 encoding shouldn't be no problem in using the std::cout with the string you enter, but, you need to tell the compiler that you typed in an utf8 string, like this:

int main()
{
   string name = u8"some æøå string";
   for_each(name.begin(), name.end(), [] (char c) {
      cout << c;
      cout << endl;

   });
   cout<<name; //this will also work
   return 0; //add this just to be tidy
}

you need to that because characters in UTF-8 might need 1,2,3 or 4 bytes depending on its face.

Then depending on what you need to do, for example split between characters, you should create a function to detect how long is each utf8 character. Then you create a 'string' for each utf8 character and extract as many characters as needed from the original string. There is a very good library (very compact) utf8proc that let you do those such things.
utf8proc helped me in many projects for resolving these kind of issues.

edited Jul 28 '20 at 18:31

answered Jul 28 '20 at 18:19

Pablo Yaggi

1,061
5
14

The fact that `"ABCÆØÅ"` is 9 chars in length on the OP's system implies that the cpp file itself is encoded in UTF-8 to begin with, in which case using the `u8` prefix is optional (but generally a good idea to be explicit). The issue here is with the terminal encoding, not the data encoding. The fact that `?` is being displayed means the OP's terminal likely DOES NOT support UTF-8 output. Or, at least, does not support outputting UTF-8 bytes as single chars (which doesn't make sense to do anyway). Try removing the `endl` from the loop to avoid flushing the output after each byte. – Remy Lebeau Jul 28 '20 at 18:22
It sounds like the poster wants to do more than just print it. – chris Jul 28 '20 at 18:22
The poster used a utf-8 string, and he can do anything with that string, even count characters, he doesn't need a std::wstring. – Pablo Yaggi Jul 28 '20 at 18:23
*I just want to part a string or do stuff with each character.* - Neither of these is as simple as the answer makes it sound. Doing something with each "character" requires grouping bytes and whatever is specifically meant by "parting" will definitely require a "character" boundary as well. stdc++ has no such facilities for UTF-8 yet, so it's definitely not as simple as marking the string UTF-8 and being done with it. That's not to say the poster needs a `wstring`, but they need more than a `u8` marker. – chris Jul 28 '20 at 18:29
2

"...characters in UTF-8..." More correct to say "...codepoints in UTF-8...", since a Unicode character may consist of 1-or-more codepoints because of combining codepoints. The standard has no upper limit as to how many combining codepoints can be used to comprise a character, regardless there are other standards which put an upper "sanity" limit of 32 codepoints which ought to be far more than enough for most legitimate purposes. – Eljay Jul 28 '20 at 19:16
@Elijay, you are correct, I voted up your comment. I was trying to be more illustrative (maybe I did a bad job at it). Do you think I should modify that in the answer, or let the poster read the comments ? – Pablo Yaggi Jul 28 '20 at 19:44
I need to part the string to use each character to lookup in a vector in order to translate to another character or string. It all works fine except that I can’t extract the Nordic characters and therefore unable to make the translation. – Jan Jul 28 '20 at 19:52
I don’t need to display or print the character, just extract it from the string. But I must say that it was a surprise to me that lengths of a string was not the actual number of characters, but another number that represented the actual number of bytes. I hope that a feature compiler can solve this so the developer don’t have worry about how the compiler store things I memory and a 6 character string has a length of 6 and string[4] is the fifth character no matter if it a Nordic or another character. – Jan Jul 28 '20 at 20:15

String handling with Nordic characters is difficult in C++

1 Answers1