2

I don't understand if for C every string is always a multibyte string meaning they are encoded as multibyte characters:

char s[] = "AAA"; 

char m[] = "X生";

is s also a multibyte string also if it doesn't contain a member of an extended character set like m?

I have this doubt because I read this from libc manuals:

string” normally refers to multibyte character strings as opposed to wide character strings. Wide character strings are arrays of type wchar_t and as for multibyte character strings usually pointers of type wchar_t * are used.

so I don't understand if multibyte is referred to the byte of the string (their number) of to the encode respect to wide character string.

xdevel2000
  • 20,780
  • 41
  • 129
  • 196
  • `s` does not include multi-byte chars there (only ASCII chars which are each a single byte wide). But the string itself is four bytes long. Are you asking if all C strings are multiple bytes long, or if all chars are? – HalosGhost Jan 26 '15 at 12:49
  • 1
    But encoded how? There are many ways to encode extended chars. – m0skit0 Jan 26 '15 at 12:50
  • I edited the question. I Hope it's clearer. – xdevel2000 Jan 26 '15 at 12:56
  • The relevant keywords are translation character set, execution character set. This [SO question](http://stackoverflow.com/questions/27872517/what-are-the-different-character-sets-used-for) seams to cover this topic properly. – Nick Zavaritsky Jan 26 '15 at 13:07

2 Answers2

3

So the C99 draft standard (C11 looks the same) defines multibyte character as follows:

sequence of one or more bytes representing a member of the extended character set of either the source or the execution environment

So a multibyte character is part of the extended character set, so s is not made up of multi-byte characters.

multibyte characters are further defined in section 5.2.1.2:

The source character set may contain multibyte characters, used to represent members of the extended character set. The execution character set may also contain multibyte characters, which need not have the same encoding as for the source character set. For both character sets, the following shall hold:

  • The basic character set shall be present and each character shall be encoded as a single byte.

  • The presence, meaning, and representation of any additional members is locales pecific.

  • A multibyte character set may have a state-dependent encoding, wherein each sequence of multibyte characters begins in an initial shift state and enters other locale-specific shift states when specific multibyte characters are encountered in the sequence. While in the initial shift state, all single-byte characters retain their usual interpretation and do not alter the shift state. The interpretation for subsequent bytes in the sequence is a function of the current shift state.

  • A byte with all bits zero shall be interpreted as a null character independent of shift state. Such a byte shall not occur as part of any other multibyte character.

Shafik Yaghmour
  • 154,301
  • 39
  • 440
  • 740
1

You can easily try to test how many bytes a string has. If I compile it on my machine with the following code:

char s[] = "AAA";
char m[] = "X生";
printf("s: %d\n", sizeof(s));
printf("m: %d\n", sizeof(m));

I'll get as an result the output

s: 4
m: 5

That means "s" isn't a multibyte string but "m" is. To make sure your compiler/system behaves the same way, I would just test it on your system.

Lucas
  • 13,679
  • 13
  • 62
  • 94
  • 1
    It's the same on my system. However what I'd like to know is if for C, terminologically, both string are always multibyte string independently if they have a member not contained into a basic charset. – xdevel2000 Jan 26 '15 at 13:04