multicharacter literal misunderstanding

Question

I thought I knew how MSVC 2010 treated multicharacter literals, until this:

int main(int argc, char* argv[]) 
{
    int a = '\'   ';
    int b = '\'/  ';
    int c = '\'/> ';
    int d = '\'/>\x20';  // same as c supposedly
    int e = 'ABC\x20';
    printf("%X\n%X\n%X\n%X  <-- what?\n%X\n", a,b,c,d,e);
    return 0;
}

27202020
272F2020
272F3E20
20272F3E  <-- what?
41424320

In the IDE's watch window if you type:

'\'/>\x20'

It prints out:

272F3E20

Which is what I would expect. So... what's going on here!?

I found this on the net, so I'm thinking it's a compiler bug. I guess it might not get fixed because it could break older code?

EDIT: I'm pretty satisfied that this is a quirk or a bug that isn't going to change. It only seems to occur when there is more than 1 escape sequence being used in the multicharacter literal.

Here is a workaround:

('\'/>' << 8) | '\n'

Those are not integer literals, they are multicharacter literals. — chris, Aug 19 '14 at 03:31
@merlin Sorry about the printf I edited it to be more awesome in the question and screwed up. Fixing it right now. — johnnycrash, Aug 19 '14 at 03:41
An obscure non-standard compiler feature no one uses has a bug? Say it ain't so! — n. m. could be an AI, Aug 19 '14 at 03:59
Likely you've come upon endianness issue. And likely this is inevitably when using such [obscure non-standard compiler-specific] techniques — user3159253, Aug 19 '14 at 03:59
@user it's definately not endianness. Seems to have something to do with using more than one escape sequence in the constant. — johnnycrash, Aug 19 '14 at 04:01
Our application does a lot of small string manipulation. This trick increases locality and as such saves an entry in the TLB and a hit on memory in the data segment which (god forbid) might not be in L2. — johnnycrash, Aug 19 '14 at 04:08
@n.m I am allowed to say \*sz = '1', but not \*(int*)sz = '4321'? — johnnycrash, Aug 19 '14 at 04:31
`'4321'` is implementation-defined. See e.g. [here](http://stackoverflow.com/questions/6944730/multiple-characters-in-a-character-constant). TL;DR you are allowed to, but it doesn't mean you should. — n. m. could be an AI, Aug 19 '14 at 05:26
If you want to encode several ASCII characters in an `int` constant, you can do that in a more portable way with shifts and bitwise-or, and avoid surprises when switching compilers. — n. m. could be an AI, Aug 19 '14 at 05:32
@n.m. I appreciate the help, but in this case, portability is not an issue. 1) We have 4,000 blades all running the same OS; 2) GCC and MSVC implement this the same on intel; 3) this is used within *portable* macros; 4) macros are more readable and less error prone than hand coding a repetitive bit shifting pattern 100,000 times; 5) and most importantly, the performance gains are very real and measured. Everything was fine until I used two escape sequences in a single literal and it went berserk. — johnnycrash, Aug 19 '14 at 17:01
There is nothing particularly repetitive in writing `#define CHAR32(a,b,c,d) (((a)<<24)|((b)<<16)|((c)<<8)|(d))` and then invoking it as needed. If you find yourself writing 100,000 literals of any kind you are probably doing something wrong anyway. — n. m. could be an AI, Aug 19 '14 at 18:37
@n.m. Or perhaps the less wrong version: #define CHAR32(a_) ((((a_)&0xFF)<<24)|(((a_)&0xFF00)<<8)|(((a_)&0xFF0000)>> 8)|(((a_)&0xFF000000)>>24)) — johnnycrash, Aug 19 '14 at 21:19

score 2 · Accepted Answer · edited May 23 '17 at 12:23

2

This appears to be a known MSVC compiler 'peculiarity'.

The C++ standard n3797 S2.14.3/1 says:

A multicharacter literal, or an ordinary character literal containing a single c-char not representable in the execution character set, is conditionally-supported, has type int, and has an implementation-deﬁned value.

So MSVC can certainly do this and claim it is 'implementation-defined' and not a bug.

If this was my call, I would probably say 'do not fix'. The risk of breaking existing code is far higher than the benefit of doing anything useful, and is easily dealt with by interesting question and answer on Stack Overflow.

Ref: see http://www.tech-archive.net/Archive/VC/microsoft.public.vc.language/2004-09/0079.html.

If you wish to reliably assemble equivalent values you have two choices, which produce opposite results depending on endianism.

You can use arithmetic operations (shift and mask) to produce an integer value:

 '\'' | ('/' << 8) | ('>' << 16) | ('\x20' << 24)

Or you can use string and cast operations to produce a string-like integer value:

*(int*)"\"/>\x20"

As per a comment, depending on how it is written this last technique can lead to generation of bad code. The string has to go somewhere (at run-time) and it will be null-terminated. The main justification is that it can avoid the need for endian-sensitive #defines and pre-processing.

See also this question: How to write a compile-time initialisation of a 4 byte character constant that is fully portable

edited May 23 '17 at 12:23

Community

1
1

answered Aug 19 '14 at 05:44

david.pfx

10,520
3
30
63

I think that is what is going on. There is some bug, or some intentional weirdness with two escape characters in a multichar literal, and they are not going to fix it. The work around is to break it into something like this: ('\'/>' << 8) | '\n'. – johnnycrash Aug 19 '14 at 17:04
1

I don't like the \*(int*)"abcd" method, because sometimes the compiler puts "abcd" in the data segment and you get code like this: mov r11d,dword ptr [string "test" (13FF131D4h)]; mov dword ptr [rbp-39h],r11d. I want code like this: mov dword ptr [rbp-39h],74736574h. I have macros that do all the endianess flipping, because we use this techinique to copy string values. So we want 'test' to work the same as "test". – johnnycrash Aug 22 '14 at 00:08
@johnnycrash: I understand your concern, but nevertheless there are cases where this technique can be used. One is when there is an intention to avoid the need for any endian-sensitive code at all. See edit. – david.pfx Aug 22 '14 at 00:16
casting it to `int*` violates aliasing rule and also invokes undefined behavior because you can't make sure if it's aligned or not – phuclv Nov 01 '17 at 03:54
@LưuVĩnhPhúc: Creating 4 byte literals from strings or characters was always implementation-defined and thus intrinsically non-portable. Once you write code targeting specific compiler and processor, alignment is just one of many considerations. Aliasing is not an issue here. – david.pfx Nov 02 '17 at 04:44
MSVC *could* do this and claim it’s “implementation-defined”, as long as it were actually *defined* somewhere in the MSVC documentation (that’s what the standard means by that phrase—it’s a license to do whatever only as long as you document that whatever). As best as I can tell, it isn’t. – Alex Shpilkin Apr 10 '23 at 20:38

multicharacter literal misunderstanding

1 Answers1