4

I'm writing a compiler that matches, within {} scope, by and large C99's semantics. When trying to reverse-engineer how gcc handles certain 'undefined behaviour', concretely, chained pre- and post-increments of variables, I noticed that it gets hopelessly confused if you combine this with modifying assignments (e.g., "*=") and array access. Simplifying to the easiest point of apparent utter confusion, gcc 4.6.3. evaluates (with and without option -std=c99):

a[0] = 2;
a[0] *= a[0]++;

to

a[0] = 3.

Am I mis-remembering the standard incorrectly? Is any use of a pre- or post-increment already undefined, not only chained use in a compound expression?

Also, even if the behaviour is 'undefined', the above seems like a particularly poor way of calculating the result as I could only see how you'd justify a result of 5 ( = 2*2 + 1, what I would have implemented - post-increment after an assignment statement), or 6 ( = 3 * 2, use a variable, then immediately post-increment it, and process in the order of parsing - the parser is almost certain to evaluate the "*=" after evaluating the RHS expression). Any insight into this - from a C or C++ angle?

I noticed this when trying to combine arrays with integer expression bounds with pre- and post-increments, and realize this is really hard; but still, the above seems a little bit like a cop-out considering the flagship status of gcc.

This is under Ubuntu 12.04.

Edit: I should have added that gcc's behavior can be reverse-engineered if the variable is not an array element - at least all examples I tried work as follows: (1) evaluate all compound expression pre-increments; (2) evaluate the expression; (3) evaluate all compound expression post-increments. So it probably has to do with the 'really hard' above as well.

Note: clang produces the philosophically reasonable value of 6. I ran more elaborate cases with clang, and am reasonably certain that it treats the array access and scalar case the same, and operates as in what I described above as the second philosophically reasonable way.

gnometorule
  • 2,151
  • 2
  • 20
  • 29
  • 1
    There's not much point in trying to get 'correct' undefined behavior, since there's really no such thing. One of the main reasons C and C++ have so much undefined behavior is to give implementers freedom to optimize (or not force them to pessimize) many operations. In other words, undefined behavior is permission for the compiler to 'cop out'. – Michael Burr May 29 '14 at 15:38
  • I don't see why you consider that result mysterious. The mutations in an modifying-assignment and in a post-increment can happen at any time in the evaluation of the expression following the initial access; so the mutations in `*=` and `()++` *are not ordered*. Apparently in this case gcc chooses to do them left-to-right. – rici May 29 '14 at 15:38
  • @rici: And he's asking why. – Lightness Races in Orbit May 29 '14 at 15:39
  • @rici: gcc still keeps processing, and produces a result. I realize that a valid answer would be '100,000' as we're dealing with undefined behavior, but I cannot see how left-tp-right parsing could ever produce '3' (I gave you my 2 interpretations of what I'd consider a philosophically justifiable way of parsing). – gnometorule May 29 '14 at 15:41
  • I put it in an answer. @LightnessRacesinOrbit: I'll try to guess a reason but it's hard to second-guess gcc. – rici May 29 '14 at 15:43
  • 1
    @rici: I don't think he wants guesswork; he wants someone who knows the answer to post the answer. Said "someone" is more-than-likely a GCC contributor. Or, someone with enough knowledge to make a _good_ guess, and from the looks of your answer, that's you. :) – Lightness Races in Orbit May 29 '14 at 15:55
  • @LightnessRacesinOrbit: Fair enough. Until he finds one, I'll leave my answer in place. – rici May 29 '14 at 15:56
  • @LightnessRacesinOrbit: that would definitely be awesome. :) – gnometorule May 29 '14 at 15:58
  • 3
    The compiler is doing `__tmp = a[0] + 1; a[0] *= __tmp; a[0] == __tmp;` which is just a valid ordering of instructions. – David Rodríguez - dribeas May 29 '14 at 16:00

4 Answers4

4

The mutations in assignments (including read-and-mutate assignments such as *=) and in post-increments can happen at any time in the evaluation of the expression following the initial access of the value of the mutated cell. Consequently, the mutations to a[0] in *= and a[0]++ are not ordered with respect to each other. Apparently in this case gcc chooses to do them left-to-right.

Or to put it more precisely, the expression:

x *= y++;

can be rewritten as:

tmp1 = y + 1;        tmp2 = x * y;
x = tmp1;            y = tmp2;

where the columns can be interleaved in any order.

Note that in practice, both of the tmp variables are probably machine registers. In fact, what is probably going on is more like this:

r1  = y;
r2  = r1 + 1;
r3  = x;
r4  = r3 * r1;
x = r4;
y = r2;

So why do the last two assignments in that order rather than the other order? Well, why not?

Obviously, the confusion here is that x and y are the same location, but gcc is not obliged to notice that. It probably uses an optimization heuristic based on the assumption that they are different locations.

But suppose that it did notice that they are the same. In that case, one or other of the assignments can be eliminated. And furthermore, the computation of the temporary which is to be stored can also be eliminate. So the compiler would have the options of eliminating r4 = r3 * r1; x = r4 or r2 = r1 + 1; y = r2. With the choice of eliminating an increment or a multiplication, what would a self-respecting optimizer do?

rici
  • 234,347
  • 28
  • 237
  • 341
  • 1
    I guess it's hard to second-guess gcc, and your point of possibly 2 registers overwriting the same location is good and might be part of the issue. Note though that a[0] = 1; a[0] *= ++a[0] + a[0]++ * ++a[0] + a[0] gives a[0] = 3, and this doesn't seem to quite mesh with the above. Also note that my compiler does this all fine for non-array variables already, and I have a pretty good understanding of this. +1 for the register overwrite though which won't explain what's going on, but I have a hunch that it points in the right direction. – gnometorule May 29 '14 at 15:53
  • @gnometorule: In that example, if the last assignment done is the one in `a[0]++` (which is the rightmost assignment ignoring the preincrements which are done prior to access) and if the accesses are done left-to-right, then the expected result is precisely `3`. The first preincrement mutates the value to 2; the postincrement stores 3 (2+1) as the eventual mutation; blah blah blah and the mutation happens at the end. – rici May 29 '14 at 16:02
  • I'm not sure I agree, but maybe. :) Also, please note my edit in the question. – gnometorule May 29 '14 at 16:05
  • @gnometorule: An even harder one would be `a[0] *= *a++` or even `*a *= (*a++)++` :) And in the last case, what if the code had been `*a *= (*b++)++` but at the beginning `a` and `b` happened to have to same value? – rici May 29 '14 at 16:12
  • My language is pointer free - which makes the claim it is by and large C heroic. :) Arrays though are treated as-if. – gnometorule May 29 '14 at 16:24
  • 1
    @gnometorule: That simplifies things for you, but not necessarily for `gcc` :) Anyway, good luck with the project. And, honestly, I wouldn't worry about "correct" undefined behaviour. I don't think gcc does. You need to worry about correct defined behaviour, and if the consequence is that UB is (even) weirder, that's the way it goes. You'll probably find that gcc produces a variety of different results depending on optimization settings. – rici May 29 '14 at 16:31
3

Am I mis-remembering the standard incorrectly? Is any use of a pre- or post-increment already undefined, not only chained use in a compound expression?

From the horse's mouth (2011 draft):

6.5 Expressions
...
2 If a side effect on a scalar object is unsequenced relative to either a different side effect on the same scalar object or a value computation using the value of the same scalar object, the behavior is undefined. If there are multiple allowable orderings of the subexpressions of an expression, the behavior is undefined if such an unsequenced side effect occurs in any of the orderings.84)

84) This paragraph renders undefined statement expressions such as
  i = ++i + 1;
  a[i++] = i;
while allowing
  i = i + 1;
  a[i] = i;

Undefined means undefined; the compiler is not obligated to produce a meaningful or logical result (it can if it wants to, but it doesn't have to); it's not even required to reproduce the same result for each run. gcc does something, but it's likely that something is pretty random given all the freedom for ordering operations.

Note that a[0] *= a[1]++ would be well-defined (provided neither a[0] nor a[1] contain trap representations, anyway).

Edit

As to how you could get 3 out of all that, I have an idea...

  1. r1 <- a[0] (evaluate LHS)
  2. r2 <- a[0] (evaluate RHS)
  3. r1 <- r1 * r2 (perform multiplication)
  4. a[0] <- r1 (write result of multiplication to LHS)
  5. a[0] <- r2 + 1 (apply side effect to RHS)

Presto; you've clobbered the result of the multiplication because of how the side effect is applied. Whether gcc actually does something like that is an open issue, and either way it's irrelevant; undefined behavior doesn't have to be consistent or repeatable.

Community
  • 1
  • 1
John Bode
  • 119,563
  • 19
  • 122
  • 198
  • +1. Thanks for addressing the first part of my question, and digging up the relevant lines form the standard. – gnometorule May 29 '14 at 16:13
  • 1
    @gnometorule: see my edit for one possible way to get a result of 3. – John Bode May 29 '14 at 17:31
  • Yes, if this is what gcc were to do, I agree. This might actually be similar to what happens - per my edit, gcc does fine if a is not an array element, and in this case there wouldn't be temporaries. So maybe. :) – gnometorule May 29 '14 at 17:40
2

The (sub-)expression a[0]++ has a value and a side effect: the value is 3; the side effect is changing the value of a[0] to 3. Note that the side effect can occur anytime between the enclosing sequence points (the end of the previous statement and the end of the current statement).

The (sub-)expression a[0] *= <value> has a value and a side effect: the value is /* old value of */ a[0] * <value>, the side effect is changing a[0] to the previous value of a[0] multiplied by <value>.

Note that you have two (sub-)expressions attempting to change the same object between sequence points. The C Standard specifically says only 1 such change is allowed.

If you are writing a compiler, you are free to detect such use and report it (with a warning or error), or not detect it and do anything (or even nothing -- ignoring the statement is perfectly fine).
anything includes making the compiler produce an executable the prints "Hello, World!" or abort the compiler itself.

pmg
  • 106,608
  • 13
  • 126
  • 198
  • -1: No, read the question again. This is not a piece of code he's writing. It's a point of curiosity in an attempt to determine specifics of compiler implementations. He's writing a compiler. – Lightness Races in Orbit May 29 '14 at 15:38
  • Your edited version actually might help explain what is going on under the hood of gcc: it performs certain actions within sequence points; stores the results; then checks for violations, and picks one of the resulting values. Note though that a[0] = 1; a[0] += ++a[0] + a[0]++ * ++a[0] + a[0] gives a[0] = 3, which implies in this case also the post-increment is dis-regarded. I find it all pretty strange. – gnometorule May 29 '14 at 15:48
  • Or even abort the program... When a program contains a sentence with undefined behaviour, the program as a whole has undefined behaviour, even before the statement causing that undefined behaviour (or at least in C++). – ABu May 29 '14 at 15:57
  • @Peregring-lk: Well, I don't know about that. – Lightness Races in Orbit May 29 '14 at 16:22
  • @LightnessRacesinOrbit Although that is what "undefined behaviour" means from the point of view of the C++ standard (or better said, what the freedom of the meaning of "undefined behaviour" implies), of course, compilers don't bring this implication so far. – ABu May 29 '14 at 19:38
  • @LightnessRacesinOrbit Definition of undefined behavior (1.3.2.4): "behavior for which this International standard imposes **no requirements**". Search now in any of the last drafts the sentence "undefined behavior" (specially after 3.6.3), and you will see always the sentence "the program has undefined behavior" (actually, I haven't check all matchs). Lastly, check it out: http://c2.com/cgi/wiki?UndefinedBehavior – ABu May 29 '14 at 20:48
  • @Peregring-lk: Mmmm yeah fair enough, most utterances are prefixed with the "the program has". Good enough for me! – Lightness Races in Orbit May 29 '14 at 21:14
-1

It's called "undefined behaviour". You are modifying the same object a[0] twice in the same statement (without intervening sequence points), once by using the *= operator, once by using the ++ operator. The behaviour is undefined. Anything can happen, including your computer crashing or worse.

Once behaviour is undefined, arguing about what the compiler did is completely pointless. It's a bug in your code. Fix it.

David
  • 2,226
  • 32
  • 39
gnasher729
  • 51,477
  • 5
  • 75
  • 98
  • 1
    Well, no, it's not "completely pointless" if you're interested in how compilers are implemented and why they do what they do. Did you read the first sentence of the question? (Or any of it, really...) Repeating your UB mantra is neither helpful nor useful in this instance: _there is no code to fix_. – Lightness Races in Orbit May 29 '14 at 15:32
  • @LightnessRacesinOrbit You are not going to understand what a modern optimizing compiler does by looking at the output of programs that invoke undefined behavior. There are hundred of passes that have only been engineered to work well together when the source code was defined. The very authors of the compilers use their complexity as an excuse for silently compiling to nonsense instead of diagnosing undefined behavior: http://blog.llvm.org/2011/05/what-every-c-programmer-should-know_21.html . It is completely pointless to try to make sense of it, as this answer says. – Pascal Cuoq May 29 '14 at 17:04
  • 1
    @ParcalCuoq: Well Pascal I'm glad that you possess supreme knowledge about what is, and is not, worthwhile wondering about, which apparently does not involve wondering how to process undefined behavior in compilers. You probably never made it to the edit either, just jumped at "undefined behavior! undefined behavior!". – gnometorule May 29 '14 at 17:22
  • @PascalCuoq: We all know that. That's why the question is asking for _other_ ways to understand what a modern optimizing compiler does. That you can't do it by looking at the output of programs that invoke undefined behaviour is _the entire premise_ of this question. – Lightness Races in Orbit May 29 '14 at 17:48
  • @PascalCuoq: Thanks for the link though. This looks like an interesting, and relevant, read. – gnometorule May 29 '14 at 18:02
  • @gnometorule It is not relevant because it is about LLVM. Your question is so bad that it only applies to one compiler, in one specific version, for one target with a specific set of compilation options. And you are right, I stopped reading at “it gets hopelessly confused”, scrolled down to the undefined code, and concluded that it was not the compiler that was hopelessly confused. – Pascal Cuoq May 29 '14 at 20:51
  • @gnometorule You may also enjoy this link: http://blog.regehr.org/archives/759 or the first half of this post of mine: http://blog.frama-c.com/index.php?post/2013/07/31/From-Pascal-strings-to-Python-tuples , although they are about undefined behavior in general, not the particular undefined behavior in your question. – Pascal Cuoq May 29 '14 at 20:54