5

I have encountered an interesting issue on Windows 8. I tested I can represent Unicode characters which are out of the BMP with wchar_t* strings. The following test code produced unexpected results for me:

const wchar_t* s1 = L"a";
const wchar_t* s2 = L"\U0002008A"; // The "Han" character

int i1 = sizeof(wchar_t); // i1 == 2, the size of wchar_t on Windows.

int i2 = sizeof(s1); // i2 == 4, because of the terminating '\0' (I guess).
int i3 = sizeof(s2); // i3 == 4, why?

The U+2008A is the Han character, which is out of the Binary Multilingual Pane, so it should be represented by a surrogate pair in UTF-16. Which means - if I understand it correctly - that it should be represented by two wchar_t characters. So I expected sizeof(s2) to be 6 (4 for the two wchar_t-s of the surrogate pair and 2 for the terminating \0).

So why is sizeof(s2) == 4? I tested that the s2 string has been constructed correctly, because I've rendered it with DirectWrite, and the Han character was displayed correctly.

UPDATE: As Naveen pointed out, I tried to determine the size of the arrays incorrectly. The following code produces correct result:

const wchar_t* s1 = L"a";
const wchar_t* s2 = L"\U0002008A"; // The "Han" character

int i1 = sizeof(wchar_t); // i1 == 2, the size of wchar_t on Windows.

std::wstring str1 (s1);
std::wstring str2 (s2);

int i2 = str1.size(); // i2 == 1.
int i3 = str2.size(); // i3 == 2, because two wchar_t characters needed for the surrogate pair.
Mark Vincze
  • 7,737
  • 8
  • 42
  • 81

3 Answers3

10

sizeof(s2) returns the number of bytes required to store the pointer s2 or any other pointer, which is 4 bytes on your system. It has nothing to do with the character(s) stored in pointed to by s2.

Keith Thompson
  • 254,901
  • 44
  • 429
  • 631
Naveen
  • 74,600
  • 47
  • 176
  • 233
  • 1
    "It has nothing to do with the character stored in s2" -- Since the question was caused by misunderstanding between pointers and the things pointed to, you should avoid causing another misunderstanding like that. There is no character stored in s2. In this case there is a character stored in s2[0] and s2[1]. If it weren't a surrogate pair then there would be a character stored in s2[0] alone, i.e. in *s2. – Windows programmer Jul 17 '12 at 02:09
5

sizeof(wchar_t*) is the same as sizeof(void*), in other words the size of a pointer itself. That will always 4 on a 32-bit system, and 8 on a 64-bit system. You need to use wcslen() or lstrlenW() instead of sizeof():

const wchar_t* s1 = L"a"; 
const wchar_t* s2 = L"\U0002008A"; // The "Han" character 

int i1 = sizeof(wchar_t); // i1 == 2
int i2 = wcslen(s1); // i2 == 1
int i3 = wcslen(s2); // i3 == 2
Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • "sizeof(wchar_t*) is the same as sizeof(void*)" -- That's not my understanding. sizeof(char*), sizeof(signed char*), and sizeof(unsigned char*) are all the same size as sizeof(void*). sizeof(wchar_t*) and sizeof(other random stuff) can be smaller than sizeof(void*) depending on the implementation. – Windows programmer Jul 17 '12 at 02:12
  • @Windowsprogrammer: Correct -- though the vast majority of modern compilers make all pointer types the same size. – Keith Thompson Jul 17 '12 at 02:27
  • Why would any compiler, let alone the C/C++ standands, allow any `sizeof(any pointer type)` be smaller than `sizeof(void*)`? From sizeof()'s perspective, a pointer is a pointer is a pointer, it doesn't matter its data type. – Remy Lebeau Jul 17 '12 at 15:47
  • @Remy Lebeau It _is_ a bad idea for modern C/C++ to allow different size types for pointers. But to give an example of potential use: in an embedded world (some using Harvard Architecture) the program and constant data might exist in one address space (ROM) and non-constant data in another (RAM). The ROM size might be MBytes, the RAM 64k. Thus pointers to functions, const data > 16 bits and pointers to changeable data is 16. PICs I used play other games to keep it all 16 bits, but thought I'd pass on an idea on how type may matter. – chux - Reinstate Monica May 31 '13 at 18:35
  • When memory was at a premium, if you could use fewer bytes to store a particular type of pointer, that was considered an advantage. One setup permitted by the standard would be for a 16-bit `int *` to address 65535 `int`s (although I don't know if anyone actually did that) – M.M Aug 19 '14 at 20:18
0

Addendum to the answers.
RE: to unravel the different units used in the question's update by i1 and i2, i3.

i1 value of 2 is the size in bytes
i2 value of 1 is the size in wchar_t, IOW 4 bytes (assuming sizeof(wchar_t) is 4).
i3 value of 2 is the size in wchar_t, IOW 8 bytes

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256