3

I am trimming a long std::string to fit it in a text container using this code.

std::string AppDelegate::getTrimmedStringWithRange(std::string text, int range)
{
    if (text.length() > range)
    {
        std::string str(text,0,range-3);
        return str.append("...");
    }
    return text;
}

but in case of other languages like HINDI "हिन्दी" the length of std::string is wrong.

My question is how can i retrieve accurate length of the std::string in all test cases.

Thanks

Wez Sie Tato
  • 1,186
  • 12
  • 33
Haroon
  • 697
  • 1
  • 9
  • 24
  • 3
    `std::string` only supports ASCII. You may want `std::wstring` instead or a similar data structure – AndyG Jul 27 '15 at 11:55
  • Can i change std::string to std::wstring and vice-versa ? – Haroon Jul 27 '15 at 11:57
  • Yeah, see this: http://stackoverflow.com/questions/2573834/c-convert-string-or-char-to-wstring-or-wchar-t – MKII Jul 27 '15 at 11:59
  • 7
    @AndyG: `std::string` does not support any particular encoding. It just stores bytes. It is perfectly capable of storing non-ascii strings. UTF-8, for example. – Benjamin Lindley Jul 27 '15 at 12:00
  • 1
    @AndyG, I don't know the exact length of that string, it looks like "3", and if you use wstring the length is 6. As Benjamin Lindley points, std::string only stores bytes the internal representation depends on your settings. – Jose Palma Jul 27 '15 at 12:03
  • @BenjaminLindley: You are correct; I mispoke. – AndyG Jul 27 '15 at 12:15
  • No offence but the suggestion to use `std::wstring` to solve this problem indicates that the issue is more than one of misspeaking. :) – Lightness Races in Orbit Jul 27 '15 at 12:57
  • yup its not a perfect solution but atleast it make the fault window small :P – Haroon Jul 27 '15 at 13:22
  • Are you sure you want the length of the string in this particular case, or the "width in pixels of the string after it's rendered"? Unless you're using a fixed-width font the latter might be more useful. – Steve Jul 27 '15 at 16:41

3 Answers3

9

Assuming you're using UTF-8, you can convert your string to a simple (hah!) Unicode and count the characters. I grabbed this example from rosettacode.

#include <iostream>
#include <codecvt>
int main()
{
    std::string utf8 = "\x7a\xc3\x9f\xe6\xb0\xb4\xf0\x9d\x84\x8b"; // U+007a, U+00df, U+6c34, U+1d10b
    std::cout << "Byte length: " << utf8.size() << '\n';
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
    std::cout << "Character length: " << conv.from_bytes(utf8).size() << '\n';
}
Ferruccio
  • 98,941
  • 38
  • 226
  • 299
  • 5
    That will produce the length of the string in Unicode codepoints, but it will not produce the display size of the string, because some Unicode characters have zero length (particularly combining characters like diacritics) while others have length two on fixed-width consoles. (If the output is not in a fixed-width font, then the situation is quite different, obviously.) In Posix, you could use `wcswidth` but that may or may not work with C++. – rici Jul 27 '15 at 15:22
7

The length of std::string is not "wrong"; you've simply misunderstood what it means. A std::string stores bytes, not "characters" in your chosen encoding. It gleefully has no knowledge of that layer. As such, the length of std::string is the number of bytes it contains.

To count such "characters", you will need a library that supports analysis of your chosen encoding, whatever that is.

Only if your chosen encoding is ASCII-compatible can you just count the bytes and be done with it.

Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
  • 1
    IIRC, there are multiple correct ways to count "characters" in Unicode. For each, you might want to count "graphemes" instead off codepoints. So even if you stick with UTF-8, you need to think about what kind of thing you'd like to count. (Again, as far as I know, I really know very little beyond the UTF-8 scheme) – Aaron McDaid Jul 27 '15 at 13:25
3

As explained in the comments, the length will return the number of bytes of your string which is encoded in utf8. In this multibyte encoding, non ascii chars are encoded on 2 to 6 bytes, so that your utf8 string length will appear longer than the real number of unicode letters.

Solution 1

If you have many long strings, you can keep them in utf8. The utf8 encoding makes it relatively easy to find out the additional multibyte characters: they a all start with 10xxxxxx in binary. So count the number of such additional bytes, and substract this from the string length

cout << "Bytes: " << s.length() << endl;
cout << "Unicode length " << (s.length() - count_if(s.begin(), s.end(), [](char c)->bool { return (c & 0xC0) == 0x80; })) << endl;

Solution 2

If more processing is needed than just counting the length, you could think of using wstring_convert::from_bytes() in the standard library to convert your string into a wstring. The length of the wstring should be what you expect.

wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> cv;
wstring w = cv.from_bytes(s);
cout << "Unicode length " << w.length() << endl;

Attention: wstring on linux is based on 32 bits wchar_t and one such wide char can contain all the unicode characeter set. So this is perfect. On windows however, wchar_t is only 16 bits, so some characters might still require multi-word encoding. Fortunately, all the hindi characters are in the range U+0000 to U+D7FF which can be encoded on one 16 bit word. So it should be ok also .

Christophe
  • 68,716
  • 7
  • 72
  • 138
  • How do you know it's UTF-8? – Lightness Races in Orbit Jul 27 '15 at 12:58
  • @LightnessRacesinOrbit Good question. Best guess: it's cocos2d-x labeld, and the [supported platforms](http://www.cocos2d-x.org/wiki/Cocos2d-x) are all unicode or UCS16 compliant. And in such context storing unicode in utf8 encoding seemed more probable than other multibyte encodings – Christophe Jul 27 '15 at 13:29