3

How can I determine the length(number of characters) in a std::wstring?

Using myStr.length() gives the byte size(I think) but its not the number of characters. Do I need to create my own function to find the number of characters or is there a native C++ way or a native WinAPI way?

sazr
  • 24,984
  • 66
  • 194
  • 362
  • 3
    http://en.cppreference.com/w/cpp/string/basic_string/size : Returns the number of characters in the string – billz Feb 21 '13 at 02:50
  • See this question: http://stackoverflow.com/questions/4183736/stdwstring-length – Robert Horvick Feb 21 '13 at 02:50
  • 2
    "*Using `myStr.length()` gives the byte size(I think) but its not the number of characters.*" Why do you think this? – ildjarn Feb 21 '13 at 02:51
  • @billz is `myStr.size()` the answer or are there some caveats? If its the answer then write an answer so I can accept it – sazr Feb 21 '13 at 02:53
  • @JakeM : `myStr.length()` is the answer – your question is a bit misguided. – ildjarn Feb 21 '13 at 02:54
  • 2
    @All `std::wstring::size() returns the number of wide-char elements in the string. This is not the same as the number of characters (as you correctly noticed). Unfortunately, the std::basic_string template (and thus its instantiations, such as std::string and std::wstring) is encoding-agnostic. In this sense, it is actually just a template for a string of bytes and not a string of characters.` Therefore `.size()` wont give me the number of characters. – sazr Feb 21 '13 at 02:54
  • @JakeM: Define "character"? Unicode does not define the concept of a character. It defines "codepoints", "grapheme clusters" and the like, but not "character." – Nicol Bolas Feb 21 '13 at 03:00
  • @NicolBolas when I say character I just mean in the generic sense where "abc" has 3 characters – sazr Feb 21 '13 at 03:03
  • @JakeM: You're in *Unicode's* world now; there is no "generic sense". There is what is defined. And what is not. And "character" is not defined. – Nicol Bolas Feb 21 '13 at 03:04
  • @NicolBolas I'd be careful with that statement.. The Unicode Standard uses the word _character_ all over the place (e.g. table 2-1: _The Unicode Standard encodes characters, not glyphs_). – jogojapan Feb 21 '13 at 03:05
  • @jogojapan: It also defines very clearly what it means by that. Specifically, it means "code point". – Nicol Bolas Feb 21 '13 at 03:09
  • @All my whole reason for this is to convert a wstring to lowercase. So if 'character' is not defined how do people convert wstrings to lowercase(relating to the english/anlgo saxon languages)? In my case I was just going to iterate over the wstring and change each character to lowercase. – sazr Feb 21 '13 at 03:11
  • 1
    @JakeM The `tolower()` function can be used for this. Have a look at the example given on cppreference: http://en.cppreference.com/w/cpp/locale/ctype/tolower. – jogojapan Feb 21 '13 at 03:16
  • There is no such thing as a 'character'. Do you talk about grapheme clusters? Code points? Code units? WHAT are you trying to do, why do you need length in 'characters'? See utf8everywhere.org for some insights.. – Pavel Radzivilovsky Feb 24 '13 at 22:07

2 Answers2

4

std::wstring::length() will give you the number of characters, where character is defined as the atomic unit of the wstring object, i.e. a wchar. This is what the Standard means when it refers to characters (see this post for some more details on the use of the word in the Standard).

However, when it comes to Unicode characters, whether one wchar corresponds to one Unicode character depends on the encoding used inside the wstring. If UTF-16 is used, which is often (but not necessarily) the case, one wchar will correspond to one Unicode character only for the base multilingual plane (i.e. all character sets derived from ISO-8859 as well as most of the commonly used CJK characters, but not some of the more exotic (e.g. classical Chinese) characters)(*). If you want to get the character count right for all Unicode characters in that case, you need to use a Unicode-aware library (e.g. ICU), or code it yourself.

(*)There are additional problems if combining characters are used, as @一二三 points out correctly. Counting those correctly is also best done using appropriate libraries.

Community
  • 1
  • 1
jogojapan
  • 68,383
  • 11
  • 101
  • 131
3

If you want to know the length in wchar_t entities, use myStr.length(). If you want to know the size in Unicode codepoints you'll have to find a library that knows how to count those. You could also write one yourself - the rules for determining whether a codepoint encoded as UTF-16 uses one or two entities are not too hard, see http://en.wikipedia.org/wiki/Utf-16. To know if your wchar_t is 16 bits (vs. 32 bits) use sizeof(wchar_t) == 2.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622