12

I just don't understand and can't find much info about wchar end.

If it ends with single null byte, how it know it not string end yet, if something like that "009A" represent one of unicode symbols?

If it ends with two null bytes? Well, I am not sure about it, need confirmation.

Keith Thompson
  • 254,901
  • 44
  • 429
  • 631
Kosmo零
  • 4,001
  • 9
  • 45
  • 88
  • in C++, i didn't knew wchar exist somewhere else – Kosmo零 Sep 06 '12 at 18:10
  • 1
    Somewhat related: [Making a WCHAR null terminated](http://stackoverflow.com/questions/1806297/making-a-wchar-null-terminated). Might be some hints in there as to how to approach this. – j.w.r Sep 06 '12 at 18:10
  • In C++, `wchar_t` (not `wchar`) is a predefined type. In C, `wchar_t` is a typedef defined in ``. In both cases, the size is implementation-defined; on my system its size is 4 bytes (32 bits). – Keith Thompson Sep 06 '12 at 18:59

4 Answers4

13

Since a wide string is an array of wide characters, it couldn't even end in an one-byte NUL. It is a two-byte NUL. (Arrays in C/C++ can only hold members of the same type, so of the same size).

Also, for ASCII standard characters, there always is one or three one-byte 0, as only extended characters start by a non-zero first byte (depending on whether wchar_t is 16 or 32 bit wide - for simplicity, I assume 16-bit and little-endian):

HELLO is 72 00 69 00 76 00 76 00 79 00 00 00
  • err, so if i access array of wchar like that: arr[0] = 0; it will set to zero first and second byte automatically? – Kosmo零 Sep 06 '12 at 18:14
  • @Kosmos (If this is not yet clear, I suggest you to read a good tutorial on C pointers and arrays!) –  Sep 06 '12 at 18:16
  • Is there anyway that wchar can be converted to char? I reversing chinese app, but as i see they are using char* for text manipulations. Could it be just wchar array converted to char* of double size? – Kosmo零 Sep 06 '12 at 18:21
  • 1
    @Kosmos There are libraries with which you can convert UTF-16 (wide strings) to UTF-8. –  Sep 06 '12 at 18:25
  • 2
    @H2CO3: On my system, `sizeof (wchar_t) == 4`. You also seem to be making assumptions about endianness. – Keith Thompson Sep 06 '12 at 18:59
  • @KeithThompson yup, that sizeof is perfectly fine. And no, I am not making assumptions about endianness - whether it be little or big endian, it's easier to conceive the essentials if I write all this using big endian notation... –  Sep 06 '12 at 19:12
  • I am trying to solve task to scan Chinese exe for text strings, for that i need to know how many bytes in the end - two null bytes or 4 – Kosmo零 Sep 06 '12 at 19:17
  • 1
    @H2CO3: "only extended characters *start* by a non-zero *first* byte" -- that assumes big-endian (with your recent edit, you've made the assumption explicit). – Keith Thompson Sep 06 '12 at 19:17
  • @KeithThompson yes, sorry, you're correct - modern processor architectures that count use the counterintuitive little-endian notation, so that's why I was confusing them... –  Sep 06 '12 at 19:20
  • Since this question is about the double byte null at the end of hte string, it's very strange that your sample string doesn't demonstrate that. – Mooing Duck Sep 06 '12 at 19:37
  • HELLO is 72 00 69 00 76 00 76 00 79 00 in little-endian byte order. The "end" in "endian" actually means the "front end" of the sequence: "In big-endian format, the most significant byte is stored first (has the lowest address) or sent first, then the following bytes are stored or sent in decreasing significance order, with the least significant byte stored last (having the highest address) or sent last." https://en.wikipedia.org/wiki/Endianness – jcsahnwaldt Reinstate Monica Jan 26 '18 at 23:25
5

Here you can read a bit more of Wide Characters: http://en.wikipedia.org/wiki/Wide_character#Size_of_a_wide_character

Terminations are L'\0', means a 16-bit null so it's like two 8-bit null chars.

Remember that "009A" is only 1 wchar so is not a null wchar.

Jorge Fuentes González
  • 11,568
  • 4
  • 44
  • 64
5

In C (quoting the N1570 draft, section 7.1.1):

A wide string is a contiguous sequence of wide characters terminated by and including the first null wide character.

where a "wide character" is a value of type wchar_t, which is defined in <stddef.h> as an integer type.

I can't find a definition of "wide string" in the N3337 draft of the C++ standard, but it should be similar. One minor difference is that wchar_t is a typedef in C, and a built-in type (whose name is a keyword) in C++. But since C++ shares most of the C library, including functions that act on wide strings, it's safe to assume that the C and C++ definitions are compatible. (If someone can find something more concrete in the C++ standard, please comment or edit this paragraph.)

In both C and C++, the size of a wchar_t is implementation-defined. It's typically either 2 or 4 bytes (16 or 32 bits, unless you're on a very exotic system with bytes bigger than 8 bits). A wide string is a sequence of wide characters (wchar_t values), terminated by a null wide character. The terminating wide character will have the same size as any other wide character, typically either 2 or 4 bytes.

In particular, given that wchar_t is bigger than char, a single null byte does not terminate a wide string.

It's also worth noting that byte order is implementation-defined. A wide character with the value 0x1234, when viewed as a sequence of 8-bit bytes, might appear as any of:

  • 0x12, 0x34
  • 0x34, 0x12
  • 0x00, 0x00, 0x12, 0x34
  • 0x34, 0x12, 0x00, 0x00

And those aren't the only possibilities.

Keith Thompson
  • 254,901
  • 44
  • 429
  • 631
1

if you declare

WCHAR tempWchar[BUFFER_SIZE];

you make it null

for (int i = 0; i < BUFFER_SIZE; i++)
            tempWchar[i] = NULL;
thang
  • 11
  • 2