0

Interested question correct processing of multi languages

// C ++. How to universally define the number of
// characters per line, regardless of the encoding?
#include <iostream>
#include <string>
using namespace std;

int main()
{
    // test line 13 characters length
    // but the result get is 19 characters
    string test_string = "string_строка";

    cout << "String length " << test_string.size() << " characters.\n";

    return 0;
}

I think that this is due to the different number of allocated memory for the characters of the Latin alphabet and the Cyrillic.
How to solve this is universal? Or simpy for Cyrillic. My system Ubuntu 14.04 (Unity). Compiler GCC 4.9.1 20140922 (Red Hat 4.9.1-10), 64 bit.

Yurii Holskyi
  • 878
  • 1
  • 13
  • 28
  • An solution without knowing the encoding and its quirks is just impossible. – deviantfan Apr 30 '16 at 08:44
  • 1
    I guess your string is UTF8 (to check, print the byte values). In this case, first decide if you want code points or glyphs. – deviantfan Apr 30 '16 at 08:45
  • @bkVnet wstring ist just an array of 2- or 4-byte tupels instead of 1-byte ones, It doesn't change anything of the problem (encoding, Unicode principles, etc.) – deviantfan Apr 30 '16 at 08:46
  • @deviantfan yes string encoding UTF-8 – Yurii Holskyi Apr 30 '16 at 08:47
  • @deviantfan one moment please wait – Yurii Holskyi Apr 30 '16 at 08:47
  • @bkVnet its work! Thanks! – Yurii Holskyi Apr 30 '16 at 08:49
  • @Did_Mazay Glyphs are what you see. Code points are entries in the Unicode "table". In Unicode, sadly, this are not the same things like they are in ASCII etc. – deviantfan Apr 30 '16 at 08:49
  • 1
    What works? ... Don't know what you did now, but if this is more than a small school assignment, you're probably missing something – deviantfan Apr 30 '16 at 08:49
  • you need to convert utf8 to utf16 to get exact length have a look at http://stackoverflow.com/questions/18921979/how-to-convert-utf-8-encoded-stdstring-to-utf-16-stdstring?rq=1 – piyushj Apr 30 '16 at 08:50
  • @piyushjaiswal For your stirng, this may work, but in the general case, this is completely wrong. – deviantfan Apr 30 '16 at 08:51
  • @deviantfan , sorry, my answer for bkVnet – Yurii Holskyi Apr 30 '16 at 08:51
  • @Did_Mazay Well, I just can suggest you to learn more about charsets. Else, your program may work for 98% of the cases, but not the rest. – deviantfan Apr 30 '16 at 08:53
  • @deviantfan, no, this not schol assignment. – Yurii Holskyi Apr 30 '16 at 08:54
  • @Did_Mazay Yeah, that's what I feared. You know, such errors can kill people (and they already did). – deviantfan Apr 30 '16 at 08:56
  • @deviantfan, but now it really worked. This is not a universal solution? – Yurii Holskyi Apr 30 '16 at 08:57
  • @Did_Mazay Did you even read my comments so far? Yes, it will work for this string, and for 98% or 99% of all other strings your program can get, but not for 100%. => It's *not* an universal solution. And the universal solution is much, much more complex. – deviantfan Apr 30 '16 at 08:58
  • @deviantfan, maybe delete my question better if it is incorrect? – Yurii Holskyi Apr 30 '16 at 09:00
  • @deviantfan I'm sorry, I'm just starting learn C++ – Yurii Holskyi Apr 30 '16 at 09:02
  • 1
    @Did_Mazay Your question is not incorrect. And being a beginner is fine too. But "it works for this input, ok, I'm done" won't go well, both with charsets and C++. Before you start writing programs that are actually used by other people, here some keywords that you should understand (really understand): Byte/Codepoint/Glyph, Charset/Encoding, UTF16 surrogates, BOM, Endianess, collation, and most complex but important: Unicode normalization. – deviantfan Apr 30 '16 at 09:32
  • 1
    @deviantfan, ok thanks for the detailed tip. – Yurii Holskyi Apr 30 '16 at 09:42

0 Answers0