This source code is switching on a string in C. How does it do that?

Question

I'm reading through some emulator code and I've countered something truly odd:

switch (reg){
    case 'eax':
    /* and so on*/
}

How is this possible? I thought you could only switch on integral types. Is there some macro trickery going on?

it is not the string `'eax'` and it enumerates constant integer value — 0___________, Aug 07 '17 at 15:35
Single quotes, not double. A character constant is promoted to `int`, so it’s legal. However, the value of a multi-character constant is implementation-defined, so the code might not work as expected on another compiler. For example, `eax` might be `0x65`, `0x656178`, `0x65617800`, `0x786165`, `0x6165`, or something else. — Davislor, Aug 08 '17 at 02:11
@Davislor: given the name of the variable "reg", and the fact that eax is an x86 register, I would guess that the implementation-defined behaviour was intended to be OK, because it's the same everywhere it's used in the code. Just as long as `'eax' != 'ebx'`, of course, so it only fails one or two of your examples. Although there might be some code somewhere that in effect assumes `*(int*)("eax") == 'eax'`, and therefore fails most of your examples. — Steve Jessop, Aug 08 '17 at 13:21
@SteveJessop I don’t disagree with what you say, but there is the real danger that someone could try to compile the code on a different compiler, even for the same architecture, and get different behavior. For example, `'eax'` might compare equal to `'ebx'` or to `'ax'`, and the switch statement would not work as intended. — Davislor, Aug 08 '17 at 21:50
All of that mystery would have quickly been dispelled if you had looked up/shown us the data type of reg. — ths, Aug 08 '17 at 22:43
By the way, I would tend to consider this code stinky. Why didn't the original designer just define an enumerated constant `reg_eax` with a nice value, like zero? `switch` statements encompassing sets of non-consecutive, large values do not compile into nice jump tables. — Kaz, Aug 09 '17 at 19:10

Bathsheba · Accepted Answer · 2017-08-07T15:40:28.123

147

(Only you can answer the "macro trickery" part - unless you paste up more code. But there's not much here for macros to work on - formally you are not allowed to redefine keywords; the behaviour on doing that is undefined.)

In order to achieve program readability, the witty developer is exploiting implementation defined behaviour. 'eax' is not a string, but a multi-character constant. Note very carefully the single quotation characters around eax. Most likely it is giving you an int in your case that's unique to that combination of characters. (Quite often each character occupies 8 bits in a 32 bit int). And everyone knows you can switch on an int!

Finally, a standard reference:

The C99 standard says:

6.4.4.4p10: "The value of an integer character constant containing more than one character (e.g., 'ab'), or containing a character or escape sequence that does not map to a single-byte execution character, is implementation-defined."

edited Aug 07 '17 at 15:40

answered Aug 07 '17 at 15:33

Bathsheba

231,907
34
361
483

55

Just in case anyone sees that and panics, "implementation-defined" is required to work and to be documented by your compiler in some appropriate fashion (the standard doesn't require that the behaviour be intuitive or that the documentation be any good, but...). This is "safe" to use for a coder who fully understands what they're writing, as opposed to "undefined". – Alex Celeste Aug 07 '17 at 18:25
4

Just a note: the original intent was for multibyte characters like Unicode. One UTF8 "character" on screen may be up to four bytes. – Zan Lynx Aug 07 '17 at 18:39
Seems to me like a conforming implementation could do something weird like define all multi-character constants to be equal, so just blindly using them doesn't seem like a good idea. Should definitely consult your implementation's documentation before making any assumptions about them – Justin Aug 07 '17 at 21:25
7

@Justin While it could, that would be quite perverse. If it doesn't do what the answer suggests is most likely, the next possibility is probably that it just uses the first character and ignores the rest. – Barmar Aug 07 '17 at 21:54
1

people who write compilers are not (in general) insane, so they try hard to map implementation-defined constructs in a useful *and logical* way, and to issue warnings when they can't. Because... engineering. -ex compiler guy. – Spike0xff Aug 08 '17 at 00:33
3

@Barmar Having officially "undefined" behavior where the compiler is allowed to make demons fly out of your nose, to the point that you have to *constantly* guard against accidentally invoking it, is pretty perverse to begin with... – jpmc26 Aug 08 '17 at 01:29
@Barmar Your "next possibility" reinforces my main point: before making any assumptions on what multibyte constants mean, you should consult your implementation's documentation – Justin Aug 08 '17 at 02:25
5

@ZanLynx I'm not positive, but I believe the feature long predates Unicode and other MBCS standards. "Magic numbers" that look like text in memory dumps and RIFF-style file-format-chunk IDs were the first applications I'm aware of. – Russell Borogove Aug 08 '17 at 03:55
16

@jpmc26 This is not undefined behavior, it's implementation-defined. So unless the compiler documentation mentions demons, your nose is safe. – Barmar Aug 08 '17 at 04:45
1

@jpmc26: Undefined behavior usually lurks in places you have to guard against anyways, so that's not really a fair description. – Aug 08 '17 at 04:56
7

@ZanLynx: I'm afraid the original intent predates Unicode, UTF-8 and any multibyte character encoding by almost 20 years. *multi-character constant* were just a handy way to express integers representing groups of 2, 3 or 4 bytes (depending on the byte and int sizes). Inconsistencies across implementations and architectures led the committee to declare this as *implementation defined*, which means there is no portable way to compute the value of `'ab'` from `'a'` and `'b'`. – chqrlie Aug 08 '17 at 06:41
@chqrlie: I am pretty sure that you're wrong about that. It was always about encoding character sets for languages other than English ASCII. The ability to make an integer of 4 bytes is just happy coincidence.. – Zan Lynx Aug 08 '17 at 17:37
@ZanLynx: here is an interesting page on multi-chars: http://www.zipcon.net/~swhite/docs/computers/languages/c_multi-char_const.html and looking back further away into the rearview mirror, here is the original C manual from the good old days: https://www.bell-labs.com/usr/dmr/www/cman.pdf . – chqrlie Aug 08 '17 at 19:05
1

@ZanLynx: ***2.3.2 character constants** [...] Character constants behave exactly like integers (not, in particular, like objects of character type). In conformity with the addressing structure of the PDP-11, a character constant of length 1 has the code for the given character in the low-order byte and 0 in the high-order byte; a character constant of length 2 has the code for the first character in the low byte and that for the second character in the high-order byte. Character constants with more than one character are inherently machine-dependent and should be avoided.* – chqrlie Aug 08 '17 at 19:06
@ZanLynx: The ability to interpret a sequence of four characters to a 32-bit unsigned value was supported by Macintosh C compilers, probably going back to their 1986 (since it had previously been supported by Macintosh Pascal compilers). Maybe some machines somewhere used multi-byte characters, but they certainly weren't common. – supercat Aug 09 '17 at 00:18
@Barmar I noticed that. I was saying the language is already perverse, so one more perverse behavior wouldn't change that much. – jpmc26 Aug 09 '17 at 03:21
Multi-character constants have nothing to do with Unicode. They simply reflect the idea that multiple characters can be packed into a word, and this is useful for creating constants which spell something. – Kaz Aug 09 '17 at 04:31
2

Apple added that most likely because their platform was chock full of "fourcc" codes for identifying data types and formats. For instance, files types. Fourcc codes were often chosen which spelled something when interpreted as ASCII bytes. – Kaz Aug 09 '17 at 04:34
Since multibyte character literals are in exactly the same place in the standard as wide characters, both are for non-ASCII language support. Otherwise `'♂'` would look funny. – Zan Lynx Aug 09 '17 at 10:42
@Kaz: Yup. Software could copy and compare such things using a single operation on a "LongInt" [Pascal] or "long" [C], but tools to manipulate things like file, application, or resource types could show them in human-readable format. File, application, or resource types with fewer than four characters were blank-padded (so a "snd " resource would be have a type of 0x736e6420), and the first character of every type that was publicly used was less than 0x80, so the lack of an unsigned 32-bit type in Pascal didn't cause any weirdness with those types. – supercat Aug 09 '17 at 15:36
1

@supercat Of course, since those character constants are not portable, what you do in portable C is `FOURCC('m', 'o', 'o', 'v')` via a macro. It's less cumbersome to just be able to use `'moov'`. – Kaz Aug 09 '17 at 16:48
1

*"Since multibyte character literals are in exactly the same place in the standard as wide characters, both are for non-ASCII language support."* Does not logically follow. They are in the same place because they are syntactically related; they are character constants. – Kaz Aug 09 '17 at 16:49
1

@Kaz: In the days when people writing compilers for a platform could be expected to honor that platforms idioms, potability of such constructs was a non-issue. If one was using the Macintosh Resource Manager, File Manager, or Desktop Manager functions, that meant one was writing code for the Macintosh and would thus be using a compiler designed for that platform. – supercat Aug 09 '17 at 18:56
Just wondering, did you use a fake account to ask this question? – Casanova Aug 14 '17 at 23:52

Vlad from Moscow · Answer 2 · 2017-08-07T16:00:29.990

According to the C Standard (6.8.4.2 The switch statement)

3 The expression of each case label shall be an integer constant expression...

and (6.6 Constant expressions)

6 An integer constant expression shall have integer type and shall only have operands that are integer constants, enumeration constants, character constants, sizeof expressions whose results are integer constants, and floating constants that are the immediate operands of casts. Cast operators in an integer constant expression shall only convert arithmetic types to integer types, except as part of an operand to the sizeof operator.

Now what is 'eax'?

The C Standard (6.4.4.4 Character constants)

2 An integer character constant is a sequence of one or more multibyte characters enclosed in single-quotes, as in 'x'...

So 'eax' is an integer character constant according to the paragraph 10 of the same section

...The value of an integer character constant containing more than one character (e.g., 'ab'), or containing a character or escape sequence that does not map to a single-byte execution character, is implementation-defined.

So according to the first mentioned quote it can be an operand of an integer constant expression that may be used as a case label.

Pay attention to that a character constant (enclosed in single quotes) has type int and is not the same as a string literal (a sequence of characters enclosed in double quotes) that has a type of a character array.

Stig Hemmer · Answer 3 · 2017-08-09T07:16:10.153

As other have said, this is an int constant and its actual value is implementation-defined.

I assume the rest of the code looks something like

if (SOMETHING)
    reg='eax';
...
switch (reg){
    case 'eax':
    /* and so on*/
}

You can be sure that 'eax' in the first part has the same value as 'eax' in the second part, so it all works out, right? ... wrong.

In a comment @Davislor lists some possible values for 'eax':

... 0x65, 0x656178, 0x65617800, 0x786165, 0x6165, or something else

Notice the first potential value? That is just 'e', ignoring the other two characters. The problem is the program probably uses 'eax', 'ebx', and so on. If all these constants have the same value as 'e' you end up with

switch (reg){
    case 'e':
       ...
    case 'e':
       ...
    ...
}

This doesn't look too good, does it?

The good part about "implementation-defined" is that the programmer can check the documentation of their compiler and see if it does something sensible with these constants. If it does, home free.

The bad part is that some other poor fellow can take the code and try to compile it using some other compiler. Instant compile error. The program is not portable.

As @zwol pointed out in the comments, the situation is not quite as bad as I thought, in the bad case the code doesn't compile. This will at least give you an exact file name and line number for the problem. Still, you will not have a working program.

other than some form of `assert('eax' != 'ebx'); //if this fails you can't compile the code because...` is there anything the original author could do to prevent other compiler failures without replacing the construct entirely> — Dan Is Fiddling By Firelight, Aug 08 '17 at 14:01
Two case labels with the same value are a constraint violation (6.8.4.2p3: "...no two of the case constant expressions in the same switch statement shall have the same value after conversion") so, as long as all the code treats the values of these constants as opaque, this is guaranteed either to work or to fail to compile. — zwol, Aug 08 '17 at 17:33
The worse part is that the poor fellow compiling on another compiler probably will not see any *compile-time* error (switching on ints is fine); instead, *run-time* errors will crop up... — tucuxi, Aug 09 '17 at 12:01

score 2 · Answer 4 · answered Aug 08 '17 at 19:19

The code fragment uses an historical oddity called multi-character character constant, also referred to as multi-chars.

'eax' is an integer constant whose value is implementation defined.

Here is an interesting page on multi-chars and how they can be used but should not:

http://www.zipcon.net/~swhite/docs/computers/languages/c_multi-char_const.html

Looking back further away into the rearview mirror, here is how the original C manual by Dennis Ritchie from the good old days ( https://www.bell-labs.com/usr/dmr/www/cman.pdf ) specified character constants.

2.3.2 Character constants

A character constant is 1 or 2 characters enclosed in single quotes ‘‘ ' ’’. Within a character constant a single quote must be preceded by a back-slash ‘‘\’’. Certain non-graphic characters, and ‘‘\’’ itself, may be escaped according to the following table:
    BS \b
    NL \n
    CR \r
    HT \t
    ddd \ddd
    \ \\
The escape ‘‘\ddd’’ consists of the backslash followed by 1, 2, or 3 octal digits which are taken to specify the value of the desired character. A special case of this construction is ‘‘\0’’ (not followed by a digit) which indicates a null character.

Character constants behave exactly like integers (not, in particular, like objects of character type). In conformity with the addressing structure of the PDP-11, a character constant of length 1 has the code for the given character in the low-order byte and 0 in the high-order byte; a character constant of length 2 has the code for the first character in the low byte and that for the second character in the high-order byte. Character constants with more than one character are inherently machine-dependent and should be avoided.

The last phrase is all you need to remember about this curious construction: Character constants with more than one character are inherently machine-dependent and should be avoided.

This source code is switching on a string in C. How does it do that?

4 Answers4

Linked