How to ignore accents in a string so it does not alter its length?

Question

I am determining the length of certain strings of characters in C++ with the function length(), but noticed something strange: say I define in the main function

string str;
str = "canción";

Then, when I calculate the length of str by str.length() I get as output 8. If instead I define str = "cancion" and calculate str's length again, the output is 7. In other words, the accent on the letter 'o' is altering the real length of the string. The same thing happens with other accents. For example, if str = "für" it will tell me its length is 4 instead of 3.

I would like to know how to ignore these accented characters when determinig the lenght of a string; however, I wouldn't want to ignore isolated characters like '. For example, if str = livin', the lenght of str must be 6.

if you are using windows, use `wstring`. I say only for windows because of [this](http://stackoverflow.com/questions/402283/stdwstring-vs-stdstring) — R Nar, Nov 24 '15 at 20:40
You are not getting an extra character because the string contains `o'` or something like this, but because the unicode character `ó` consists of two bytes. — Baum mit Augen, Nov 24 '15 at 20:40
Welcome to the sad word of text encoding in source literals, text encoding in general, variable-length encodings in particular and maybe unicode normalization if you feel strong enough. First of all, you should specify the encoding you are using for the text in your application, for your source files and how your compiler is set up in that respect. Also, since the C++ standard is severely lacking talking about encodings, knowing what compiler you are using on what platform could be useful. — Matteo Italia, Nov 24 '15 at 20:41
It sounds like you're using UTF-8 encoding, but it would be best if this was specified in the question itself. Otherwise, answers will include guesses that may not be helpful to future readers. — MrEricSir, Nov 24 '15 at 20:47
@MrEricSir excuse my ignorance, but how do I know what kind of encoding I am using? — Carl Rojas, Nov 24 '15 at 20:49
Why do you need the length in characters? What *is* a character? — n. m. could be an AI, Aug 07 '18 at 21:06
Are you wanting "length" as in "number of columns in the terminal"? Because if so, you also need to worry about multi-column characters - see `\uff20` and most Asian characters. And even then, not all terminals use the same version of the standard ... — o11c, Aug 07 '18 at 21:19
Carl, there is no text but encoded text. "How do I know what [which character] encoding I am using?" There are many contexts where this is important. Firstly, you choose when you save your source file. You then have to tell your compiler. Every communication of text involves bytes and an encoding. But, what @n.m., says. Please [edit] your question. — Tom Blodget, Aug 07 '18 at 23:49

geza · Answer 1 · 2018-08-07T21:19:10.967

3

It is a difficult subject. Your string is likely UTF-8 encoded, and str.length() counts bytes. An ASCII character can be encoded in 1 byte, but characters with codes larger than 127 is encoded in more than 1 byte.

Counting unicode code points may not give you the answer you needed. Instead, you need to take account the width of the code point to handle separated accents and code points with double width (and maybe there are other cases as well). So this is difficult to do this properly without using a library.

You may want to check out ICU.

If you have a constrained case and you don't want to use a library for this, you may want to check out UTF-8 encoding (it is not difficult), and create a simple UTF-8 code point counter (a simple algorithm could be to count bytes where (b&0xc0)!=0x80).

edited Aug 07 '18 at 21:19

answered Aug 07 '18 at 20:59

geza

28,403
6
61
135

"You need to normalize the string first" -- Don't forget that not all combinations of letters and diacritics have precomposed forms, so normalisation won't necessarily help. And when you take character widths into account, I think normalisation no longer matters: combining characters should be treated as having a width of 0. – Aug 07 '18 at 21:09
@hvd: absolutely valid point, I've modified my answer a little bit. – geza Aug 07 '18 at 21:17
ICU is famous for being an annoying library to use, w.r.t. versions ... once of my long-term goals is to write a library where you only have to change a *data* file to update the version of the unicode standard you're using – o11c Aug 07 '18 at 21:21
@o11c: maybe it is possible only if you can have scripts in that data. – geza Aug 07 '18 at 21:24

score 0 · Answer 2 · answered Nov 24 '15 at 20:40

0

Sounds like UTF-8 encoding. Since the characters with the accents cannot be stored in a single byte, they are stored in 2 bytes. See https://en.wikipedia.org/wiki/UTF-8

answered Nov 24 '15 at 20:40

DBug

2,502
1
12
25

How to ignore accents in a string so it does not alter its length?

2 Answers2