Find length of std::wstring

Question

How can I determine the length(number of characters) in a std::wstring?

Using myStr.length() gives the byte size(I think) but its not the number of characters. Do I need to create my own function to find the number of characters or is there a native C++ way or a native WinAPI way?

http://en.cppreference.com/w/cpp/string/basic_string/size : Returns the number of characters in the string — billz, Feb 21 '13 at 02:50
See this question: http://stackoverflow.com/questions/4183736/stdwstring-length — Robert Horvick, Feb 21 '13 at 02:50
"*Using `myStr.length()` gives the byte size(I think) but its not the number of characters.*" Why do you think this? — ildjarn, Feb 21 '13 at 02:51
@billz is `myStr.size()` the answer or are there some caveats? If its the answer then write an answer so I can accept it — sazr, Feb 21 '13 at 02:53
@JakeM : `myStr.length()` is the answer – your question is a bit misguided. — ildjarn, Feb 21 '13 at 02:54
@All `std::wstring::size() returns the number of wide-char elements in the string. This is not the same as the number of characters (as you correctly noticed). Unfortunately, the std::basic_string template (and thus its instantiations, such as std::string and std::wstring) is encoding-agnostic. In this sense, it is actually just a template for a string of bytes and not a string of characters.` Therefore `.size()` wont give me the number of characters. — sazr, Feb 21 '13 at 02:54
@JakeM: Define "character"? Unicode does not define the concept of a character. It defines "codepoints", "grapheme clusters" and the like, but not "character." — Nicol Bolas, Feb 21 '13 at 03:00
@NicolBolas when I say character I just mean in the generic sense where "abc" has 3 characters — sazr, Feb 21 '13 at 03:03
@JakeM: You're in *Unicode's* world now; there is no "generic sense". There is what is defined. And what is not. And "character" is not defined. — Nicol Bolas, Feb 21 '13 at 03:04
@NicolBolas I'd be careful with that statement.. The Unicode Standard uses the word _character_ all over the place (e.g. table 2-1: _The Unicode Standard encodes characters, not glyphs_). — jogojapan, Feb 21 '13 at 03:05
@jogojapan: It also defines very clearly what it means by that. Specifically, it means "code point". — Nicol Bolas, Feb 21 '13 at 03:09
@All my whole reason for this is to convert a wstring to lowercase. So if 'character' is not defined how do people convert wstrings to lowercase(relating to the english/anlgo saxon languages)? In my case I was just going to iterate over the wstring and change each character to lowercase. — sazr, Feb 21 '13 at 03:11
@JakeM The `tolower()` function can be used for this. Have a look at the example given on cppreference: http://en.cppreference.com/w/cpp/locale/ctype/tolower. — jogojapan, Feb 21 '13 at 03:16
There is no such thing as a 'character'. Do you talk about grapheme clusters? Code points? Code units? WHAT are you trying to do, why do you need length in 'characters'? See utf8everywhere.org for some insights.. — Pavel Radzivilovsky, Feb 24 '13 at 22:07

score 4 · Accepted Answer · edited May 23 '17 at 12:04

4

std::wstring::length() will give you the number of characters, where character is defined as the atomic unit of the wstring object, i.e. a wchar. This is what the Standard means when it refers to characters (see this post for some more details on the use of the word in the Standard).

However, when it comes to Unicode characters, whether one wchar corresponds to one Unicode character depends on the encoding used inside the wstring. If UTF-16 is used, which is often (but not necessarily) the case, one wchar will correspond to one Unicode character only for the base multilingual plane (i.e. all character sets derived from ISO-8859 as well as most of the commonly used CJK characters, but not some of the more exotic (e.g. classical Chinese) characters)^(*). If you want to get the character count right for all Unicode characters in that case, you need to use a Unicode-aware library (e.g. ICU), or code it yourself.

^(*)There are additional problems if combining characters are used, as @一二三 points out correctly. Counting those correctly is also best done using appropriate libraries.

edited May 23 '17 at 12:04

Community

1
1

answered Feb 21 '13 at 03:00

jogojapan

68,383
11
101
131

8

"*If UTF-16 is used, which is commonly*" If by "commonly", you mean "on Windows". – Nicol Bolas Feb 21 '13 at 03:01
@NicolBolas I'll change it to _often_ :) – jogojapan Feb 21 '13 at 03:04
1

Even within the BMP for UTF-16, combining forms and presentation forms may make one "character" appear as two `wchar`s (and vice versa). – 一二三 Feb 21 '13 at 07:06
@一二三 Very true! I've added a footnote for this. – jogojapan Feb 21 '13 at 07:09

score 3 · Answer 2 · answered Feb 21 '13 at 03:01

If you want to know the length in wchar_t entities, use myStr.length(). If you want to know the size in Unicode codepoints you'll have to find a library that knows how to count those. You could also write one yourself - the rules for determining whether a codepoint encoded as UTF-16 uses one or two entities are not too hard, see http://en.wikipedia.org/wiki/Utf-16. To know if your wchar_t is 16 bits (vs. 32 bits) use sizeof(wchar_t) == 2.

Find length of std::wstring

2 Answers2

Linked