1

I have a Unicode string and I need to know the number of bytes it uses.

In general, I know that wcslen(s) * 2 will work. But my understanding is that this is not reliable when working with Unicode.

Is there a reliable and performant way to get the number of bytes used by a Unicode string?

πάντα ῥεῖ
  • 1
  • 13
  • 116
  • 190
Jonathan Wood
  • 65,341
  • 71
  • 269
  • 466
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/232150/discussion-on-question-by-jonathan-wood-getting-the-number-of-bytes-in-a-unicode). – Machavity May 09 '21 at 20:51

1 Answers1

2

wcslen counts the number of wchar_t entities, until it finds a NUL character. It doesn't interpret the data in any way.

(wcslen(s) + 1) * sizeof(wchar_t) will always, reliably calculate the number of bytes required to store the string s.

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
IInspectable
  • 46,945
  • 8
  • 85
  • 181
  • 1
    It might be worth noting that the `2` is correct for Windows, where `wchar_t` is UTF-16, but not on other platforms where it is UTF-32 (and thus 4 byes). – Quentin May 09 '21 at 19:35
  • @que Much as I like accuracy, this is just adding noise. The question is properly tagged [tag:winapi]. It's ABI mandates `wchar_t` to be 2 bytes. – IInspectable May 09 '21 at 19:37
  • @IInspectable Guys, the answer seems to report wrong length for UCS-4 strings. See https://onlinegdb.com/HyslATB_d. Please, comment. Regards. It shows that "foo" is 8 symbols long. – bimjhi May 09 '21 at 21:03
  • Algorithm used on onlinegdb, which converts from UTF-16 to UCS-4, is here: https://stackoverflow.com/questions/8540090/clang-converting-const-char16-t-utf-16-to-wstring-ucs-4 – bimjhi May 09 '21 at 21:13
  • @bimjhi - your comment is wrong - question is trivial - simply found 0 symbol . and how many bytes from string begin to this 0. how encoding here related at all ? – RbMm May 09 '21 at 21:27
  • @RbMm For some cause, if I convert wstring to UTF-16 and then to UCS-4, `(wcslen(s) + 1) * sizeof(wchar_t)` returns wrong length. To give and example, it returned 8 for "foo" string, which I reproduced on onlinegdb and my local Visual Studio on Windows. Regards. – bimjhi May 09 '21 at 21:30
  • @bimjhi - *if I convert* - for what you convert ? sense ?! question simply about length in bytes until 0 – RbMm May 09 '21 at 21:32
  • Try yourself to take a length in bytes for UCS-4 string using `(wcslen(s) + 1) * sizeof(wchar_t)`. This is what I'm talking about. – bimjhi May 09 '21 at 21:36
  • I just mean that UCS-4 is a part of Unicode standard, nothing more. – bimjhi May 09 '21 at 21:39
  • @bimjhi - you not understand question at all, question about size in *bytes* until 0 characters. you all time say about *length*. – RbMm May 10 '21 at 06:52
  • 1
    @bim This question is the result of confusing character encoding and storage. C doesn't know **anything** about Unicode, and deals with the storage aspects only. On Windows, a `wchar_t` is 2 bytes, and `wcslen` reports the number of **code units** up to the first zero terminator of any given string. It expressly doesn't deal with any encoding aspects, and doesn't report the number of **code points**, which the question seems to imply. If you are talking about UTF-16, you are referring to something this Q&A doesn't apply to. – IInspectable May 10 '21 at 07:26
  • @bimjhi `u16string(wstr.begin(), wstr.end())` is not the correct way to convert a `wstring` to a `u16string`, especially on systems where `sizeof(wchar_t) > sizeof(char16_t)`. But in any case, since you are dealing with `wstring`, you could replace `wcslen(ws.c_str())` with `ws.size()` and get the same result. 8 is the correct **byte size** for wstring `L"foo"` when `sizeof(wchar_t) == 2` (Windows) AND you +1 the **character length** to include the null terminator. `L"foo"` on Windows is 8 bytes `66 00 6F 00 6F 00 00 00`, you can verify that with `sizeof(L"foo")` (ie `(3+1)*sizeof(wchar_t)`) – Remy Lebeau May 12 '21 at 23:08
  • @bimjhi same with a properly created `u16string` (consider using `std::u16string u16str = u"foo";` instead), as `(u16str.size() + 1) * sizeof(char16_t)` is 8 for `u"foo"` on all platforms (`(3+1)*sizeof(char16_t)`). `wstring` and `u16string` are both used for UTF-16 strings on Windows. – Remy Lebeau May 12 '21 at 23:15
  • @rem That's not strictly correct. [char16_t](https://en.cppreference.com/w/c/string/multibyte/char16_t) is *"at least 16 bits"*. Yet another one of the many missed opportunities of introducing Unicode properly. – IInspectable May 13 '21 at 02:39