15

What is the string terminator sequence for a UTF-16 string?

EDIT:

Let me rephrase the question in an attempt to clarify. How's does the call to wcslen() work?

Deduplicator
  • 44,692
  • 7
  • 66
  • 118
Ray
  • 153
  • 1
  • 1
  • 4
  • +1 Regardless of the fact that it will sollicit "wrong question" type responses, I love this type question because it is just the thing that google nor wikipedia will tell you. – sehe May 07 '11 at 20:59
  • 2
    Probably because it's too obvious. :-) BTW, C does not allow UTF-16 as the encoding for `wchar_t`, and cannot simply because it doesn't work with the C API for wide characters, which assumes each multibyte character corresponds to a *single* `wchar_t` value, not a sequence of `wchar_t` values. You're stuck with either UCS-2 or standard functions that fail to obey the requirements of the standard if you insist on making `wchar_t` 16-bit... – R.. GitHub STOP HELPING ICE May 07 '11 at 21:42
  • On every system I’ve every used, `sizeof(wchar_t)` == 4 bytes, or 32 bits. I didn’t think it would work otherwise. – tchrist May 07 '11 at 22:53
  • Microsoft Visual C++ has `sizeof(wchar_t) == 2`, much to the annoyance of programmers who need to write cross-platform libraries that support Unicode. – dan04 May 10 '11 at 02:57

3 Answers3

17

Unicode does not define string terminators. Your environment or language does. For instance, C strings use 0x0 as a string terminator, as well as in .NET strings where a separate value in the String class is used to store the length of the string.

To answer your second question, wcslen looks for a terminating L'\0' character. Which as I read it, is any length of 0x00 bytes, depending on the compiler, but will likely be the two-byte sequence 0x00 0x00 if you're using UTF-16 (encoding U+0000, 'NUL')

ceztko
  • 14,736
  • 5
  • 58
  • 73
Michael Petrotta
  • 59,888
  • 27
  • 145
  • 179
  • 9
    Small point of trivia - .NET `String` objects have the length *and* a null terminator internally. That allows them to be used directly by interop functions which expect a terminator. – Jon Skeet May 07 '11 at 21:01
  • @Jon: had no idea, thanks. I assume I won't find that terminator if I go hunting `Chars`? – Michael Petrotta May 07 '11 at 21:03
  • ...which I don't have direct access to, of course, and `ToCharArray` could do whatever it wants, including likely some native magic. – Michael Petrotta May 07 '11 at 21:06
  • 1
    It seems that wcslen() must find at minimum, two 0x00 bytes since the character 'a' is UTF-16 encoded as 0x6100. – Ray May 07 '11 at 21:31
  • 1
    @Ray: that's true, if your environment uses UTF-16. My point was that a wide character, as used by `wcslen`, doesn't have a defined length. You're free to use (a made-up) UTF-128, and then `wcslen` would be looking for a string of 16 `0x00` bytes. – Michael Petrotta May 07 '11 at 21:40
  • @MichaelPetrotta can you correct your affirmation about .NET null termination of strings? We can definitely trust Jon Skeet here, and I also confirm .NET runtime terminates strings for interop purposes. – ceztko Mar 01 '19 at 16:33
  • The wrong affirmation about missing null termination of strings in .NET has not been corrected yet. – ceztko Jun 27 '19 at 09:34
  • Corrected the answer after verified the inactivity of the user. – ceztko Jun 27 '19 at 09:49
5

7.24.4.6.1 The wcslen function (from the Standard)

...

   [#3]   The  wcslen  function  returns  the  number  of  wide
   characters that precede the terminating null wide character.

And the null wide character is L'\0'

pmg
  • 106,608
  • 13
  • 126
  • 198
  • `wchar_t null = L'\0'; printf("null is %d bits\n", 8 * sizeof null);` prints out that null is 32 bits. – tchrist May 07 '11 at 22:55
  • @tchrist: you should be using `CHAR_BIT` instead of the magic 8. That `null` has the same size as each of the (4) elements of the array `L"foo"`. – pmg May 07 '11 at 22:58
4

There isn't any. String terminators are not part of an encoding.

For example if you had the string ab it would be encoded in UTF-16 with the following sequence of bytes: 61 00 62 00. And if you had 大家 you would get 27-59-B6-5B. So as you can see no predetermined terminator sequence.

Darin Dimitrov
  • 1,023,142
  • 271
  • 3,287
  • 2,928