0

Below is an excerpt from an old edition of the book Programming Windows by Charles Petzold

There are, of course, certain disadvantages to using Unicode. First and foremost is that every string in your program will occupy twice as much space. In addition, you'll observe that the functions in the wide-character run-time library are larger than the usual functions.

Why would every string in my program occupy twice the bytes, should not only the character arrays we've declared as storing wchar_t type do so?

Is there perhaps some condition that if a program is to be able to work with Long values, then the entire program mode it'll operate on is altered?

Usually if we declare a long int, we never fuss over or mention the fact that all ints will be occupying double the memory now. Are strings somehow a special case?

mindoverflow
  • 730
  • 4
  • 13
  • 2
    I believe the assumption is with regard to Windows API functions. The W versions of those functions that take string parameters will require wide strings. That might be 100% of the strings or 1%, it depends on the program. – Retired Ninja Dec 22 '21 at 06:45
  • wchar_t size isn't fixed just like other types in C, but typically on Windows it's 2-byte long on Unix it's 4-byte long – phuclv Dec 22 '21 at 08:40
  • @phuclv you've now brought up something interesting here. I've always known a normal char represents 1 byte, a wide one will represent 2. in x86 an int is 4 bytes, and long is 8 bytes, long long would be 16; just like that, perhaps, UNIX takes into account storage of upto all the way UTF-32 instead of UTF-8? – mindoverflow Dec 22 '21 at 14:23
  • @RetiredNinja yes, that's what I believe. as long as it doesn't down an MRI machine, i think i could live with using wide chars. i believe most systems nowadays would not make an issue out of this – mindoverflow Dec 22 '21 at 14:30
  • @mindoverflow `char` in C has at least 8 bits, so a `char` can represent all Unicode characters on a platform with 32-bit `char`. Windows is just an early adopter of Unicode, before the era of UCS-4 so it uses UCS-2 which is now a subset of UTF-16. Unix came much later to the Unicode world and use UTF-8 for narrow string for backward compatibility, but I have no idea why they chose UTF-32 for wide strings – phuclv Dec 22 '21 at 14:51
  • `in x86 an int is 4 bytes, and long is 8 bytes, long long would be 16;` this isn't correct. Size depends on compiler implementation and has nothing to do with the hardware architecture, for example on 64-bit Windows `long` is a 32-bit type – phuclv Dec 22 '21 at 14:52

3 Answers3

1

If a string could potentially contain a character outside of the ascii range, you'll have to declare it as a wide string. So most strings in the program will be bigger. Personally, I wouldn't worry about it; if you need Unicode, you need Unicode, and a few more bytes aren't going to kill you.

That seems to be what you're saying, and I agree. But the question is skating the fine line between opinionated and objective.

rici
  • 234,347
  • 28
  • 237
  • 341
  • `If a string could potentially contain a character outside of the ascii range, you'll have to declare it as a wide string` this is simply not true. You can store any Unicode characters with UTF-8 in a normal string. And nowadays even [Windows supports the UTF-8 locale](https://stackoverflow.com/a/63454192/995714) – phuclv Dec 22 '21 at 08:39
  • @phuclv logically, how would you go about explaining that? There just aren't enough bytes in a 128 byte representaton to hold unicode values. my question _refers_ to windows programming, but the main concern is how C is handling it, not the windows api. – mindoverflow Dec 22 '21 at 14:18
  • @mindoverflow C doesn't have Unicode support apart from some `wchar_t` functions which is far from complete Unicode support. `There just aren't enough bytes in a 128 byte representaton to hold unicode values` this doesn't make sense. ASCII is a 7-bit charset that can represent 128 different values, not 128-byte. I don't understand what you say. In short, UTF-8 is a variable-length encoding where a code point can be represented by 1 to 4 bytes – phuclv Dec 22 '21 at 14:47
  • @mindoverflow: utf-8 is a variable-length multibyte encoding. Any Unicode character can be represented, using from one to four bytes, with the one-byte codes corresponding to the 128 Ascii codes. C has minimal support for UTF-8; anything interesting, such as determining how long a code sequence is, has to be done explicitly or with 3rd party libraries. There's no primitive datatype in C which represents a single UTF-8 sequence; normally, the sequences are converted to a single integer. – rici Dec 22 '21 at 14:47
  • None of that is intended as a criticism. I strongly recommend the use of UTF-8 for data interchange. There are excellent unicode libraries. But you need to have appropriate expectations :-). The issue with Windows is using Unicode strings in system APIs, such as filepaths. There, you need to consult the Windows documentation. See the link in phuclv's first comment above. – rici Dec 22 '21 at 14:54
  • Since C17 or so, C provides `char16_t` and `char32_t`, as well as string literals which produce UTF-8 strings (as `char*`), UTF-16 strings (as char16_t*` ) and UTF-32 strings (as `char32_t*`), as well as standard functions for converting strings from one representation to another. See `uchar.h`, if it is present in your C library. – rici Dec 22 '21 at 15:08
1

Why would every string in my program occupy twice the bytes, should not only the character arrays we've declared as storing wchar_t type do so?

As I understand it, it is meant, that if you have a program that uses char *, and now you rewrite that program to use wchar_t *, then it will use (more than) twice the bytes.

KamilCuk
  • 120,984
  • 8
  • 59
  • 111
  • Yes I'm also thinking that. We're on the same page, right? If I don't declare a string to be wide, then I am not using he memory allocation of wide, correct? – mindoverflow Dec 22 '21 at 14:20
  • `If I don't declare a string to be wide, then I am not using he memory allocation of wide, correct?` Yes. – KamilCuk Dec 22 '21 at 14:56
0

Unicode have some types : utf8, utf16 utf32. https://en.wikipedia.org/wiki/Unicode. You can check advantage , disadvantage of them to know what situation you should use .

reference: UTF-8, UTF-16, and UTF-32

long.kl
  • 670
  • 4
  • 12