1

Consider

#include <string>
#include <iostream>

int main()
{ 
    std::string test="αλφα";
    std::cout << "size() of '" << test << "' = " << test.size() << std::endl;
}

which produces

size() of 'αλφα' = 8

How can I with the C++ standard library find the width of the output that will be produced by writing a string (i.e. 4 in the example above)?

Walter
  • 44,150
  • 20
  • 113
  • 196
  • Does https://stackoverflow.com/a/18850689/5470596 answers your question? – YSC Jan 24 '19 at 16:15
  • I'm not sure about the dup. OP might want a generic, not UTF-8-only answer. – YSC Jan 24 '19 at 16:20
  • Interesting reading http://utf8everywhere.org/#myth.strlen – Jarod42 Jan 24 '19 at 16:26
  • @YSC Agreed, nothing useful in the standard library, so roll your own simple decoder. Only the OP can tell us if their code is unicode utf-8 or a specific MBCS, but I would recommend using utf-8 if you have a choice as it is "everywhere" – Gem Taylor Jan 24 '19 at 17:05

1 Answers1

1

The problem here is related to the encoding associated with the string.

This looks like UTF-8 encoding to me (the first character is not the lower case 'a'). In that encoding, the characters you present take two bytes each which accounts for the answer.

UTF-8 encoding is broadly supported by the C++11 standard (rather elegantly UTF-8 doesn't have any zero bytes in any text stream cf. Windows Unicode) - you can use std::string although the lengths will, in general, be understated - but care must be taken when creating string literals of that type directly in your editor.

More reading from here: How to use Unicode (UTF-8) in C++

Bathsheba
  • 231,907
  • 34
  • 361
  • 483
  • 1
    On windows, `std::wstring` can help. See https://stackoverflow.com/questions/402283/stdwstring-vs-stdstring – YSC Jan 24 '19 at 16:17
  • When you say it is "broadly supported" do you mean there is a wide range of support, or that C++ acknowledges it exits can can work with it (kind of)? – NathanOliver Jan 24 '19 at 16:18
  • @YSC: Indeed it can, although I'd recommend, on balance, using UTF-8. – Bathsheba Jan 24 '19 at 16:19
  • I think its' UTF-8 -- there is no lower case 'a', but two 'α' (greek alpha). Also, this is on a MAC, definitely no micro software. – Walter Jan 24 '19 at 16:24
  • I'm not sure why the UTF-8 representation for the string here would be 7 bytes large, as I only see small greek alpha characters, and no lower case 'a'. Here, I obtained: "αλφα" -> "ce b1 ce bb cf 86 ce b1" – SirDarius Jan 24 '19 at 16:24
  • @SirDarius: Oops yes you are correct, I've changed my opinion. – Bathsheba Jan 24 '19 at 16:27
  • For the sake of comparing apples to apples, what Microsoft calls "Unicode" is either UTF-16 or UCS-2 depending on the age and quality of the software you're considering ;) – Quentin Jan 24 '19 at 18:01
  • 1
    @Quentin: Either way it's full of damn NUL characters! – Bathsheba Jan 24 '19 at 18:01