1

The following string has size 4 not 3 as I would have expected.

std::string s = "\r\n½"; 
int ss = s.size(); //ss is 4

When loop through the string character by character escaping it to hex I get

  • 0x0D (hex code for carriage return)
  • 0x0A (hex code for line feed)
  • 0xc2 (hex code, but what is this?)
  • 0xbd (hex code for the ½ character)

Where does the 0xc2 come from? Is it some sort of encoding information? I though std::string had a char per visible character in the string. Can someone confirm 0xc2 is a "character set modifier"?

Johan
  • 502
  • 4
  • 18
  • Your file was encoded as UTF-8, and your compiler reads it as UTF-8, so ½ was converted to its UTF-8 binary representation which is 0xC2DB. C++ does not care about the original encoding, what it cares about is what's in memory when `s` is constructed, and what was in memory was 0D0AC2BD, so `s` was constructed with 4 bytes (4 `char` in your case). which is why you get this result. You would get different results if the file had been encoded with a different encoding, or if `char` had been 2 bytes long. – Holt Jun 14 '18 at 12:46

1 Answers1

11

"½" has, in unicode, the code point U+00BD and is represented by UTF-8 by the two bytes sequence 0xc2bd. This means, your string contains only three characters, but is four bytes long.

See https://www.fileformat.info/info/unicode/char/00bd/index.htm

Additional reading on SO: std::wstring VS std::string.

YSC
  • 38,212
  • 9
  • 96
  • 149