2

Considering the following code:

#include <stdio.h>

int main()
{
    char A = A ? 0[&A] & !A : A^A;
    putchar(A);
}

I'd like to ask, whether any undefined behaviour is observed in it or not.

Edit

Please note: the code intentionally uses 0[&A] & !A and NOT A & !A (see response below)

End edit

Taking the output ASM from g++ 6.3 (https://godbolt.org/g/4db6uO) we get (no optimizations were used):

main:
    push    rbp
    mov     rbp, rsp
    sub     rsp, 16
    mov     BYTE PTR [rbp-1], 0
    movzx   eax, BYTE PTR [rbp-1]
    movsx   eax, al
    mov     edi, eax
    call    putchar
    mov     eax, 0
    leave
    ret

However clang gives a lot more code for the same thing (no optimizations again):

main:                                   # @main
    push    rbp
    mov     rbp, rsp
    sub     rsp, 16
    mov     dword ptr [rbp - 4], 0
    cmp     byte ptr [rbp - 5], 0
    je      .LBB0_2
    movsx   eax, byte ptr [rbp - 5]
    cmp     byte ptr [rbp - 5], 0
    setne   cl
    xor     cl, -1
    and     cl, 1
    movzx   edx, cl
    and     eax, edx
    mov     dword ptr [rbp - 12], eax # 4-byte Spill
    jmp     .LBB0_3
.LBB0_2:
    movsx   eax, byte ptr [rbp - 5]
    movsx   ecx, byte ptr [rbp - 5]
    xor     eax, ecx
    mov     dword ptr [rbp - 12], eax # 4-byte Spill
.LBB0_3:
    mov     eax, dword ptr [rbp - 12] # 4-byte Reload
    mov     cl, al
    mov     byte ptr [rbp - 5], cl
    movsx   edi, byte ptr [rbp - 5]
    call    putchar
    mov     edi, dword ptr [rbp - 4]
    mov     dword ptr [rbp - 16], eax # 4-byte Spill
    mov     eax, edi
    add     rsp, 16
    pop     rbp
    ret

And Microsoft VC compiler gives:

EXTRN   _putchar:PROC
tv76 = -12                                          ; size = 4
tv69 = -8                                         ; size = 4
_A$ = -1                                                ; size = 1
_main   PROC
    push     ebp
    mov      ebp, esp
    sub      esp, 12              ; 0000000cH
    movsx    eax, BYTE PTR _A$[ebp]
    test     eax, eax
    je       SHORT $LN5@main
    movsx    ecx, BYTE PTR _A$[ebp]
    test     ecx, ecx
    jne      SHORT $LN3@main
    mov      DWORD PTR tv69[ebp], 1
    jmp      SHORT $LN4@main
$LN3@main:
    mov      DWORD PTR tv69[ebp], 0
$LN4@main:
    mov      edx, 1
    imul     eax, edx, 0
    movsx    ecx, BYTE PTR _A$[ebp+eax]
    and      ecx, DWORD PTR tv69[ebp]
    mov      DWORD PTR tv76[ebp], ecx
    jmp      SHORT $LN6@main
$LN5@main:
    movsx    edx, BYTE PTR _A$[ebp]
    movsx    eax, BYTE PTR _A$[ebp]
    xor      edx, eax
    mov      DWORD PTR tv76[ebp], edx
$LN6@main:
    mov      cl, BYTE PTR tv76[ebp]
    mov      BYTE PTR _A$[ebp], cl
    movsx    edx, BYTE PTR _A$[ebp]
    push     edx
    call     _putchar
    add      esp, 4
    xor      eax, eax
    mov      esp, ebp
    pop      ebp
    ret      0
_main   ENDP

But with optimizations we get so more cleaner code (gcc and clang):

main:                                   # @main
    push    rax
    mov     rsi, qword ptr [rip + stdout]
    xor     edi, edi
    call    _IO_putc
    xor     eax, eax
    pop     rcx
    ret

And a sort of mysterious VC code (seems the VC compiler can't understand a joke ... and it just does not precalculate the right hand side).

EXTRN   _putchar:PROC
_A$ = -1                                                ; size = 1
_main   PROC                                      ; COMDAT
    push     ecx
    mov      cl, BYTE PTR _A$[esp+4]
    test     cl, cl
    je       SHORT $LN3@main
    mov      al, cl
    xor      al, 1
    and      cl, al
    jmp      SHORT $LN4@main
$LN3@main:
    xor      cl, cl
$LN4@main:
    movsx    eax, cl
    push     eax
    call     _putchar
    xor      eax, eax
    pop      ecx
    pop      ecx
    ret      0
_main   ENDP

Some Warnings:

  1. You should not write code like this. This is definitely bad coding style and never should go into a serious application. Just for fun.

Some Explanations:

  1. I look for undefined behaviour since the value of A is used in its initialization. Again: You should not do this.
  2. However the way the expression is built up, both parts of the code will yield 0, as the compilers

So I am in this dilemma right now whether is this UB or not UB.

2501
  • 25,460
  • 4
  • 47
  • 87
Ferenc Deak
  • 34,348
  • 17
  • 99
  • 167
  • 7
    Undefined behavior has *nothing* to do with what the compiler does. If you want to know if something is undefined, read the standard. If it says undefined, it's undefined. – nvoigt Feb 20 '17 at 13:46
  • 8
    Yes it is undefined behavior. You are using the value of `A` to initialize itself. What any particular compiler chose to do with this code does not make it defined behavior. The C or C++ standards are what describe defined behavior. – Cory Kramer Feb 20 '17 at 13:46
  • 2
    Your code can be boiled down to `char A = A ? 0 : 0;` which is still UB (using an uninitialized variable). I can't see how it would fail but UB is UB. – NathanOliver Feb 20 '17 at 13:49
  • I think it might be defined because you're using a character type to access the uninitialized local variable. (http://port70.net/~nsz/c/c11/n1570.html#6.2.6.1p5) – Petr Skocik Feb 20 '17 at 13:56
  • Strictly speaking, your are reading the value of `A` before it is initialized. The value in `A ? ... : ...` is unspecified, and reading it is UB if memory serves. – StoryTeller - Unslander Monica Feb 20 '17 at 13:59
  • [dcl.init] /12 (http://eel.is/c++draft/dcl.init#12) is probably relevant. And the fact that the type is the narrow character type might be significant. – eerorika Feb 20 '17 at 14:01
  • DR451 says the putchar should have undefined behaviour, because the value of A after the expression is still indeterminate. – Antti Haapala -- Слава Україні Feb 20 '17 at 14:22
  • You posted C code, compiled it with g++, and have added both [tag:c] and [tag:c++] tags. In case it does make a difference, which language tag is the correct one? – IInspectable Feb 20 '17 at 14:39
  • This looks very opinion based to me - `A` is initialized with `0` in both cases, just the way how the program calculates `0` is different. In this case, the random value of `A` has no influence to the result of the expression. – Aemyl Feb 20 '17 at 14:47
  • Tempted to close because it is unclear what you mean by "undefined behavior is observed". You can't observe the undefined, right? Also, the question is marked as C and C++. You should choose one, because they have different behavior. – Johannes Schaub - litb Feb 20 '17 at 14:58
  • @Aemyl there shouldn't be anything opinion-based about the standards, either they define something, or they don't. – Antti Haapala -- Слава Україні Feb 20 '17 at 15:09
  • @Antii Haapala in my opinion the question is more related to logic than to the standards. Sorry if I am wrong about this ... – Aemyl Feb 20 '17 at 15:13
  • 2
    @Aemyl: Logic alone cannot answer a question about language rules. Those rules are arbitrary. And if you read [this proposed answer](http://stackoverflow.com/a/42346432/1889329), you'll see, that `A` can have any value. As far as the language goes, it's value is indeterminate, even though logic wants you to believe, that the result of the expression is always `0`. – IInspectable Feb 20 '17 at 15:47
  • Logic: if the precondition is false you can derive everything even truth. If your program is UB then the result can be a "valid" program. Assembly will not help you. – knivil Feb 20 '17 at 16:23
  • So actually the whole question is about if `0[&A]` is synonym to `A` in every case related to the standard? In clearer words? – Aemyl Feb 20 '17 at 22:06
  • @IInspectable: I could see a rationale for allowing something like `(257+0[&a]-0[&a]) >> 4` to yield any value in the range 0-32 (an implementation could perform two reads of `0[&a]`, at different times, and the underlying storage might have different values when the two reads are performed). I don't really see much value in clang's approach, but I'm not sure how to ask what the point of clang's approach is without seeming argumentative. – supercat Feb 20 '17 at 22:50
  • @Aemyl no, it is not about that; and `A` is not equal to `0[&A]`, because with latter it means that the address of `A` has now been taken. – Antti Haapala -- Слава Україні Feb 21 '17 at 07:57

3 Answers3

7

First of all, if char corresponds to unsigned char, a char cannot have a trap representation; however if char corresponds to signed char it can have trap representations. Since using a trap representation has undefined behaviour, it is more interesting to modify the code to use unsigned char:

unsigned char A = A ? 0[&A] & !A : A^A;
putchar(A);

Initially I believed that there isn't any undefined behaviour in C. The question is is A uninitialized in a manner that has undefined behaviour, and the answer is "no", because, although it is a local variable with automatic storage duration, it has its address taken, so it must reside in memory, and its type is char, therefore its value is unspecified but specifically it cannot be a trap representation.

The C11 Appendix J.2. specifies that the following has undefined behaviour:

An lvalue designating an object of automatic storage duration that could have been declared with the register storage class is used in a context that requires the value of the designated object, but the object is uninitialized. (6.3.2.1).

with 6.3.2.1p2 saying that

If the lvalue designates an object of automatic storage duration that could have been declared with the register storage class (never had its address taken), and that object is uninitialized (not declared with an initializer and no assignment to it has been performed prior to use), the behavior is undefined.

Since the address of A is taken, it could not have been declared with the register storage class, and therefore its use does not has undefined behaviour as per this 6.3.2.1p2; instead it would have an unspecified yet valid char value; chars do not have trap representations.

However, the problem is that there is not any requirement that A must yield the same unspecified value all over, as unspecified value is

valid value of the relevant type where this International Standard imposes no requirements on which value is chosen in any instance

And the answer to C11 Defect Report 451 seems to consider this to have undefined behaviour after all, saying that the result of using an indeterminate value (even with types that have no trap representations, such as unsigned char) in arithmetic expressions will also mean that the result will have unstable values and that use of such values in library functions will have undefined behaviour.

Thus:

unsigned char A = A ? 0[&A] & !A : A^A;

doesn't invoke undefined behaviour as such but A is still initialized with an indeterminate value, and use of such an indeterminate value in call to a library function putchar(A) should be considered as having undefined behaviour:

Proposed Committee Response

  • The answer to question 1 is "yes", an uninitialized value under the conditions described can appear to change its value.
  • The answer to question 2 is that any operation performed on indeterminate values will have an indeterminate value as a result.
  • The answer to question 3 is that library functions will exhibit undefined behavior when used on indeterminate values.
  • These answers are appropriate for all types that do not have trap representations.
  • This viewpoint reaffirms the C99 DR260 position.
  • The committee agrees that this area would benefit from a new definition of something akin to a "wobbly" value and that this should be considered in any subsequent revision of this standard.
  • The committee also notes that padding bytes within structures are possibly a distinct form of "wobbly" representation.
Community
  • 1
  • 1
  • Exactly my point of dilemma: The code intentionally is NOT `A & !A` but something with an address. – Ferenc Deak Feb 20 '17 at 13:56
  • 1
    That's an answer based on plausibility alone. That's not how the language works. No matter how plausible something appears to be, if the specification calls it UB, UB it is. (Not saying that you are wrong, just that your reasoning is not convincing.) – IInspectable Feb 20 '17 at 14:01
  • @IInspectable but it specifically doesn't call it UB – Antti Haapala -- Слава Україні Feb 20 '17 at 14:11
  • 1
    If it does, why not add a the relevant reference(s) to this answer? – IInspectable Feb 20 '17 at 14:16
  • DR 451 concerns only indeterminate values. But in this case that doesn't matter, as the resulting value of *this particular expression* is always 0. Indeed we see that a smart enough compiler optimizes all of it away. No UB. – rustyx Feb 20 '17 at 14:52
  • @RustyX what 0? DR 451 concerns this *very* thing, it says that indeterminate value multiplied by 0 is indeterminate. – Antti Haapala -- Слава Україні Feb 20 '17 at 15:01
  • Well OK whatever the committee were smoking, it must have been good. "*any operation performed on indeterminate values will have an indeterminate value as its result*"... – rustyx Feb 20 '17 at 15:08
  • The problem, as I understand it, is rather that reading `A` while it is indeterminate may give different results each time. So `A ?` could evaluate to 1 and then `!A` could evaluate to `!0` and then `0[&A]` could evaluate to random garbage, and the result will be random garbage bitwise AND with 1. – Lundin Feb 20 '17 at 15:28
  • @IInspectable: In the interest of accommodating the use of C for many platforms with different abilities and application fields with different needs, the authors of the C89 Standard made no attempt to imagine much less forbid every unreasonable thing a C compiler might do, since behaviors that would be patently unreasonable on most (platform+field) combinations might possibly be reasonable on others. Instead, they rely upon implementers to exercise judgment and behave in a fashion appropriate to the target platform and application field. Generating optimal code for... – supercat Feb 20 '17 at 18:14
  • ...heavy-duty number crunching of trustworthy inputs may require a compiler to make assumptions that would be inappropriate when compiling operating-system code, and a compiler which claims to be suitable for e.g. systems programming should recognize that. The fact that the Standard allows a compiler to do something should not be taken as implying that the authors would regard such action as ever being appropriate in a high-quality compiler for some particular target and field--merely that there might be a target and field where the action might sometimes be appropriate. – supercat Feb 20 '17 at 18:27
  • @RustyX: Some compiler writers are highly invested in the ability to treat "Indeterminate" as a value, rather than merely allow for non-determinism. If `a` holds a non-deterministic value, then after `b=a & 9;`, and before the next operation affecting `a` or `b`, a compiler may substitute `a & 9` for `b`, which could yield 0, 1, 8, or 9. If code performs `c=b+b;`, a compiler may regard `c` as holding `(a & 9)+(a & 9)` and may, at any time, replace any occurrences of `a` with any value that `a` might hold. This would allow `c` to behave non-deterministically as any member of the set... – supercat Feb 20 '17 at 21:48
  • {0,1,2,8,9,10,16,17,18}, but only as those values. Note that a compiler wouldn't be particularly expected to keep track of which values `c` could hold; instead, the set of values that `c` might end up holding would be a consequence of which computations the compiler decided to perform when. – supercat Feb 20 '17 at 21:51
  • PS--Given `typedef struct { char st[16];} string15;`, I wonder whether the committe think compilers should regard `{ string15 a; strcpy(a.st, "Hello"); string15 b=a;}` as having defined behavior, or if they think programmers should be required to write code equivalent to `{ string15 a; strncpy(a.st, "Hello", 16); string15 b=a;}` instead [even if only the first six characters in each array would be relevant]. – supercat Feb 21 '17 at 20:36
  • Re “Since using a trap representation has undefined behaviour…”: This is not a universal rule, and it does not apply here. C 2018 (same in 2011) 6.2.6.1 5 says “Certain object representations need not represent a value of the object type. If the stored value of an object has such a representation and is read by an lvalue expression **that does not have character type**, the behavior is undefined…” (emphasis added). Since the `char A` in the question has character type, accessing it never has undefined behavior due to it being a trap representation. – Eric Postpischil Jun 14 '21 at 10:09
1

This is a category of behavior where the Standard would strongly imply a behavior, and nothing in the Standard would invite an implementation to jump the rails, but the official "interpretation" would nonetheless allow compilers to behave in arbitrary fashion. As such, it would not be accurate to describe the behavior as "undefined" [because the text of the Standard does imply a behavior and says nothing to suggest that it shouldn't apply] nor would it be accurate to simply say it's "defined" [because the Committee says compilers may behave in arbitrary fashion]. Instead it's necessary to recognize an intermediate condition.

Because different application fields (number crunching, systems programming, etc.) benefit from different kinds of behavioral guarantees, and because some platforms may be able to uphold certain guarantees more cheaply than others, the authors of every C standard to date have generally sought to avoid passing judgment on the relative costs and benefits of various guarantees. Instead, they have shown significant deference to implementers' judgment with regard to what guarantees should be provided in what implementations.

If it is plausible that offering some particular behavioral guarantee would have no value in some application field (even if it may be vital in others), and waiving that guarantee might sometimes allow some implementations to be more efficient (even if in most cases it wouldn't), the authors of the Standard will generally not require that guarantee. Instead, they let implementers decide, based upon an implementation's target platform(s) and intended application field(s) whether the implementation should always support that guarantee, never support that guarantee, or allow programmers to select (via command-line options or other means) whether to support the guarantee.

A quality implementation intended for any particular purpose (e.g. systems programming) will support the kinds of behavioral guarantees that would make a compiler suitable for that purpose (e.g. reading an unsigned char that a program owns will never have any effect beyond yielding a possibly-meaningless value), whether the Standard requires it to do so or not. The authors of the C Standard don't require nor intend that all implementations be suitable for fields like systems programming, and thus don't require that implementation aimed at other fields like number crunching uphold such a guarantee. The fact that compilers targeting other fields may not uphold the kinds of guarantees required for systems programming means that it's important that systems programmers ensure that they use tools which are suitable for their purposes. Knowing that a tool promises to supports the guarantees one needs is far more important than knowing that present interpretations of the Standard support such a guarantee, given that a guarantee which is considered unambiguous today might disappear if a compiler writer can suggest that waiving it might sometimes be beneficial.

supercat
  • 77,689
  • 9
  • 166
  • 211
  • Regarding the quality of systems programming tools, well, I couldn't make my GCC 6.2.0 to emit any warning whatsoever :( – Antti Haapala -- Слава Україні Feb 21 '17 at 04:40
  • That doesn't appear to address the question. It's essentially saying: *"If you need to write code like that, pick a compiler you know produces the desired result. The language specification is irrelevant."* The question, however, is specifically asking about the language specification. That's what the [tag:language-lawyer] tag is for. – IInspectable Feb 21 '17 at 08:22
  • @IInspectable: I see this as something of an XY question [asking X, but needs to know Y]. There are two main reasons someone would typically want to know if X is defined by the Standard: (1) is it reasonable for code targeting some implementation to assume X is defined; (2) is it reasonable for an implementation to treat X as undefined. In either case, the answer will depend upon aspects of the implementation like its intended application field. – supercat Feb 21 '17 at 15:41
  • @IInspectable: I was a bit slow getting to the point, however, so I added another paragraph at the start. Do you think that helps? – supercat Feb 21 '17 at 15:45
  • BTW, the standard explicitly states that *If a ''shall'' or ''shall not'' requirement that appears outside of a constraint or runtime- constraint is violated, the behavior is undefined. Undefined behavior is otherwise indicated in this International Standard by the words ''undefined behavior'' or **by the omission of any explicit definition of behavior**. There is no difference in emphasis among these three; they all describe ''behavior that is undefined''.* – Antti Haapala -- Слава Україні Mar 03 '17 at 14:35
  • @AnttiHaapala: Except in the case of automatic variable whose address is never taken, an Indeterminate Value is defined as either being a value of the appropriate type or a trap representation. If an implementation uses unsigned `char`, or uses signed `char` but does not specify a trap representation therefor (IMHO the Standard should specify that if `signed char` has a trap representation, `char` must be unsigned--I doubt any non-contrived implementations with signed `char` have ever had trap representations for it) the only remaining possibility would be that... – supercat Mar 03 '17 at 15:32
  • ...reading an Indeterminate Value of type `char` yields a value of type `char`. An implementation might not promise anything about whether future reads of the storage holding the Indeterminate Value would yield the same result, but I see no room in the Standard for a read of an Indeterminate Value of type `char` to do anything other than yield a value of the type, except on (almost certainly contrived) implementations where `char` is a signed type with a trap representation. – supercat Mar 03 '17 at 15:34
1

The right-hand side first evaluates A.

In C++, since A is uninitialized at this point, the code causes undefined behaviour.

In C11, since A is uninitialized at this point, its value could be a trap representation, therefore the code causes undefined behaviour.

In C11, if we were on a system that is known to have no trap representations (or we change char to unsigned char), then A has indeterminate value, and then putchar(A) causes undefined behaviour by passing an indeterminate value to a library function.

Further reading for C11 uninitialized variable use.

Community
  • 1
  • 1
M.M
  • 138,810
  • 21
  • 208
  • 365
  • Re “In C11, since `A` is uninitialized at this point, its value could be a trap representation, therefore the code causes undefined behaviour”: Reading a `char` type cannot cause undefined behavior due to it being a trap representation. C 2011 6.2.6.1 5 says “Certain object representations need not represent a value of the object type. If the stored value of an object has such a representation and is read by an lvalue expression **that does not have character type**, the behavior is undefined…” (emphasis added). – Eric Postpischil Jun 14 '21 at 10:13