Why does (p=p) & (q=q); in C trigger undefined behavior

Question

Why does (*p=*p) & (*q=*q); in C trigger undefined behavior if p and q are equal.

int f2(int * p, int * q)
{
  (*p=*p) & (*q=*q);
  *p = 1;
  *q = 2;
  return *p + *q;
}

Source (Nice article by the way): http://blog.frama-c.com/index.php?post/2012/07/25/On-the-redundancy-of-C99-s-restrict

Can you quote where the article author says it is undefined behavior? I found only one mention, and it doesn't seem to refer to this code. — Hi-Angel, Jun 28 '15 at 13:36
@alk But the rules about sequencing are different in C and C++ afaik. (Afaik, "sequenced before" does not even exist in C, they still have sequence points). So this should be two different question for two different languages if he actually cares about both. — Baum mit Augen, Jun 28 '15 at 13:36
@BaummitAugen: Or ask what is the difference in this case regarding UB between C and C++? ;-) — alk, Jun 28 '15 at 13:39
@alk That might be a valid new question too, but: *"Why does (*p=*p) & (*q=*q); **in C** trigger undefined behavior"* — Baum mit Augen, Jun 28 '15 at 13:40
@BaummitAugen: Hu, missed this detail ("*.. in C ...*") in the question, sry. — alk, Jun 28 '15 at 13:41
@alk the duplicate is sufficiently different that this question warrants its own answer, imho. Yes it's sequence points but not every sequence point question is the same — M.M, Jun 28 '15 at 13:44
`^` as well as `&` are binary operators. However feel free to reopen this. @MattMcNabb — alk, Jun 28 '15 at 13:45
All the discussion is necessary only if the compiler does not optimize out the whole statement under discussion. :-) — Sourav Ghosh, Jun 28 '15 at 13:48
It is UB because you cannot modify a memory location twice without an intervening sequence point. — n. m. could be an AI, Jun 28 '15 at 13:49
@SouravGhosh wrong. We are discussing to which extent the program in question is valid, which entails what exactly different compilers are allowed or not allowed to do with it. There is no "the" compiler pertinent to this discussion. — n. m. could be an AI, Jun 28 '15 at 13:59
Isn't it like reading and writing to same memory address at same time? `*p=*p`. And will assignment come first before logical `&` operation? — Rakholiya Jenish, Jun 28 '15 at 14:02
@n.m. I did not get you sir. The result of `&` is not stored anywhere, neither used. With an aggressive optimization, compiler is free to optimize out the whole statement. Correct me if i'm wrong. — Sourav Ghosh, Jun 28 '15 at 14:03
@SouravGhosh Of course any compiler is free to deduce that the first statement of the function has no effect and optimise it away completely, in all cases. Which, incidentally, some compilers actually do. We are trying to discuss, in effect, whether there are possible optimisations beyond that. For instance, may compilers assume that `p != q` or not? If they can, there are other optimisation opportunities. Contrary to what you suggest, these may be pursued even if the first line is actually optimised away. — n. m. could be an AI, Jun 28 '15 at 14:21
@Hi-Angel The sentence “The first statement in the body informs the compiler that `f2()` will never be called with aliasing arguments `p` and `q`” is intended to mean that it is undefined behavior to call `f2` with aliasing arguments `p` and `q` because of the first statement `(*p=*p) & (*q=*q);`. — Pascal Cuoq, Jun 30 '15 at 10:13

M.M · Answer 1 · 2015-06-29T01:30:33.957

4

If *p and *q designate the same memory location, then writing to them both without an intervening sequence point (or sequence relation in C11) causes undefined behaviour.

= and & do not introduce sequence points.

The code is equivalent to int i = 0; (i=i) & (i=i); which has UB for the same reason. Another similar example would be (*p = 1) & (*q = 2).

edited Jun 29 '15 at 01:30

answered Jun 28 '15 at 14:53

M.M

138,810
21
208
365

2

`*p = *q` is not UB. The assignment (side effect) is sequenced after computations of `*p` and `*q`. – Anton Savin Jun 28 '15 at 15:07
@AntonSavin oops, fixed – M.M Jun 29 '15 at 01:30

score 4 · Accepted Answer · edited Jun 20 '20 at 09:12

The ruling of the C11 Standard on the statement

(*p=*p) & (*q=*q);

is:

P1

§6.5p3

The grouping of operators and operands is indicated by the syntax. 85) Except as specified later, side effects and value computations of subexpressions are unsequenced.

Since §6.5.10 Bitwise AND operator fails to mention sequencing of its operands, it follows that (*p=*p) and (*q=*q) are unsequenced.

P2

§6.5p2

If a side effect on a scalar object is unsequenced relative to either a different side effect on the same scalar object or a value computation using the value of the same scalar object, the behavior is undefined. If there are multiple allowable orderings of the subexpressions of an expression, the behavior is undefined if such an unsequenced side effect occurs in any of the orderings. 84)

Both assignments (*p=*p) and (*q=*q) are unsequenced w.r.t. each other by §6.5p3, and have a side-effect on the same object if p==q. Therefore, if p==q, then by §6.5p2 we have UB.

P3

§3.4.3

undefined behaviour

behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements.

By this clause we know that the standard imposes no requirements on UB. This is commonly interpreted by compilers as a license to ignore the possibility that such behaviour occurs.

In particular, it allows the compiler to not handle the case p == q, which means that it may assume that p != q.

P1+P2+P3 -> C1

Because (*p=*p) and (*q=*q) may be assumed by the combined premises P1, P2 and P3 not to invoke UB, they may also be assumed to be loads and stores to different memory locations. This also means that the return value of f2 must be 3 and not 4. If p == q, the Standard imposes no requirements on what occurs.

Could you help me in explaining the sentence in the referenced article about the first statement _"The first statement in the body informs the compiler that f2() will never be called with aliasing arguments p and q. "_? — Paul Ogilvie, Jun 28 '15 at 22:06
@PaulOgilvie Yes; My conclusion C1 points the way. Because the compiler is entitled to assume UB does not occur, it does not have to generate code for the case `p == q`, because if that case were to occur, UB would happen _unconditionally_ (as the statement is the first thing in the function). Here's the compiler's reasoning as a code transformation: http://pastebin.com/JEbdkk0Z — Iwillnotexist Idonotexist, Jun 28 '15 at 23:05

score 2 · Answer 3 · answered Jun 28 '15 at 23:17

2

In simple terms, (*p = *p) & (*q = *q) is undefined if p and q have the same value because:

You can't mutate the same location twice in an unsequenced evaluation; and
You can't read from a location which is being mutated in the same unsequenced evaluation.

This is undefined behaviour in both C and C++, although the standard wordings are slightly different (and the above text doesn't correspond to either standard precisely; it was intended as an simplified explanation. I'm sure you can find precise texts on SO.)

The & operator is a simple bitwise and, so it does not impose any evaluation order. It may seem like *p = *p is an obvious no-op, but there is no guarantee that it is implemented that way. A compiler may (for example) implement that as tmp = *p; *p = 0; *p += tmp. It may also not be able to set all the bits of *p at once, requiring that the assignment be done piecemeal.

Now, a little personal bugbear. The expression <something> "triggers undefined behaviour" makes it sound like there is some category of behaviour called "undefined behaviour", perhaps a kind of big red button which will start firing nasal demons in all directions when pressed. That's not a good model for what is happening. It is better to say "the behaviour of <something> is undefined".

Be aware than the behaviour of an entire program is undefined if any part of the program which is executed has undefined behaviour. The entire program, not the part of the program starting with the part with undefined behaviour.

Finally -- and this is the point of the linked article -- the compiler is allowed to assume that the behaviour of the program is defined. Consequently, if the program includes an expression like (*p = *p) & (*q = *q), then the compiler can assume that p and q point to different non-overlapping objects. And once it makes that assumption, it may be able to better optimize expressions involving both *p and *q. It is also likely that once the compiler has made that assumption, then it can eliminate the entire computation of (*p = *p) & (*q = *q), because intermediate values of *p and *q (if there are any) are not observable if p and q are distinct. So you can think of that expression as a kind of declaration: you are promising the compiler that you have done whatever is necessary to guarantee that p and q point to different non-overlapping objects. (The compiler will not, and probably cannot, verify your claim. It will just take your word for it.)

The author then argues that this idiom is more powerful than the (somewhat controversial) restrict keyword. I have no doubt that it is, and it is probably possible to construct expressions like that to cover a number of restrictions which cannot be easily expressed with restrict. So to that extent, it seems an interesting idea. On the other hand, the precise expression is, to say the least, obscure and easy to get wrong.

answered Jun 28 '15 at 23:17

rici

234,347
28
237
341

2

The specific phrase "Undefined Behavior" has a technical meaning which is very different from the English meaning of "the Standard does not define the behavior of X". Those not familiar with C-style Undefined Behavior would be be more to interpret "The Standard does not define the behavior of `-1<<1`" as meaning "The expression `-1<<1` may yield a value other than -2", rather than "No misbehavior stemming from any input which could cause a conforming compiler to try to evaluate `-1<<1` shall be considered a violation of the Standard, even if such misbehavior precedes any possible evaluation". – supercat Jun 29 '15 at 16:48
@supercat: It's true that the difference between undefined and unspecified behaviour is not obvious from the terms. (Since behaviour which is not defined is undefined behaviour (4/para 2), it is necessary to spell out unspecified behaviour explicitly.) The important aspect of UB is that the behaviour of the entire program which relies on UB is undefined, even before the construct whose behaviour is undefined is executed. (I said that in the answer, and you said that in your comment. So we are in agreement here.) This is precisely why I dislike "triggers UB", which implies a sequence of events. – rici Jun 29 '15 at 19:06
Any time I refer to C-style Undefined Behavior, I always use the phrase capitalized as shown; I interpreted your objection as being to the idea that the invocation of "Undefined Behavior" means something beyond the fact that a program does something which the standard doesn't define, rather than the fact that "triggering" implies causality. I would dispense with the latter objection by noting that "Undefined Behavior revokes laws of time and causality". Incidentally, the revocation of causality is worse than what would be authorized by merely allowing compilers to assume a program... – supercat Jun 29 '15 at 19:12
You can dispense with the objection if you wish, but it seems to me that the unsophisticated readers you refer to might not immediately understand that causality is revoked, and capitalization won't help with that. Whatever. The revocation of causality is irrelevant if the program's behaviour is, in fact, defined; no one is talking here about how invalid programs might be executed, but rather about how valid programs might be optimized given some extra hints. – rici Jun 29 '15 at 19:18
...won't engage in Undefined Behavior, since although such assumption would be valid grounds for revoking the laws of time, it would not be sufficient to revoke the laws of causality. For that reason, I almost always explicitly state that Undefined Behavior in C/C++ (and almost no other language) revokes the laws of time *and causality*. With regard "valid programs", look at my answer. I would posit that where hyper-modern UB is an absolute killer is with programs which work correctly with valid inputs, but may invoke UB with some invalid inputs. Are such programs "valid"? – supercat Jun 29 '15 at 19:23
@supercat: I'd say "no", but I'm not sure that the standard backs me up. In practice, a program should always check its inputs for validity, which is why I'd say that a program which doesn't do that is wrong. However, the standard refers to "A program that is correct in all other respects, operating on correct data" (4/p2), which is open to the interpretation that a program may *document* what are valid inputs rather than try to detect invalid ones. (In fact, the standard itself does this with respect to the inputs to a compiler.) – rici Jun 29 '15 at 19:30
@supercat: If, however, you are asking whether a program whose behaviour is undefined on certain inputs has undefined behaviour for all inputs, I think the answer is clearly no. The abstract machine is not a program analysis tool: it describes a non-deterministic evaluation which includes input. – rici Jun 29 '15 at 19:32
My point is that hyper-modern philosophy dictates that optimizing the behavior of programs when given input that would not cause UB is more important than ensuring that programs won't launch nuclear missiles even when given invalid inputs that would cause UB. Consider a graphic decoder built into a browser. If the remote computer supplies an invalid graphics file, it's acceptable to have the browser display arbitrary pixels, but it should not allow the remote computer to cause arbitrary code execution. In cases where a function will only be acting on pre-validated data, it may be useful... – supercat Jun 29 '15 at 19:38
...to have a means of telling the compiler that it may make certain assumptions about the data, with unbounded consequences if those assumptions prove false. In cases where a function must accept "untrusted" data, however, requiring that user code rigidly validate everything will often add a lot of complexity and overhead which an optimizer would not be able to remove, and which may seriously impede parallelization. Having means of avoiding unbounded UB without such rigid checking would allow programs to be much cleaner more efficient. – supercat Jun 29 '15 at 19:45
@supercat: I understand your opinion, but I fail to see how a comment thread in SO is an appropriate venue to air it. As you would see if you read my answer to the end, I'm not enthralled with Pascal Cuoq's suggestion either. You might want to take it up with him instead (http://stackoverflow.com/users/139746/pascal-cuoq). Addendum: Checking user input is messy, but if you don't want to end up with injection attacks, buffer overruns, etc., you need to do it. Avoiding UB is the easiest part; most injection attacks are not UB. – rici Jun 29 '15 at 19:51
Since I don't think any constructs in C include "arbitrary code injection" among their defined behaviors, I don't see how it could occur without Undefined Behavior. Returning back to your original post, my point was that I did not interpret your post as adequately conveying the idea that hyper-modern philosophy dictates that all forms of UB should be considered malignant whether or not it would cost anything to make the consequences benign on a given platform. – supercat Jun 29 '15 at 20:11
@supercat: Passing an untrusted string into a call to execve is defined behaviour, but it's not a good idea. Ditto, many database interfaces. For example. I apologize for my post not being sufficiently vehement in support of your strongly-held opinion. – rici Jun 29 '15 at 20:20

score 2 · Answer 4 · answered Jun 29 '15 at 17:56

When the C standard was written, if the effect of a certain action would vary on different platforms, it would not always be possible for a particular platform to guarantee any particular precise effect, and if there might plausibly exist implementations where the action could trigger a hardware trap whose behavior was outside the control of the C compiler, there was little perceived value in having the Standard say anything about the behavior. Even if there wasn't any significant likelihood of a hardware trap, the possibility of "surprising" behavior was sufficient to brand behavior as Undefined.

Consider for example, unsigned long x,*p; ... *p=(x++);. If p==&x, it would not only be possible that *p might end up holding not only the old value of x, or a value 1 greater, but if x was e.g. 0x0000FFFF it might also plausibly end up holding 0x00000000, or 0x0001FFFF. Even if no machine would trigger a hardware trap, I don't think the Standard's authors would have considered "Any lvalue modified more than once will hold Indeterminate Value, and any read of an lvalue in the same expression that writes it in a manner other than allowed herein may yield Indeterminate Value" to be in any way more useful than simply declaring such actions to be Undefined Behavior. Further, from the point of view of the Standard's authors, the failure of the Standard to mandate particular behaviors in cases where some platforms could provide free of charge and others could not was not expected to pose an obstacle to the specification of such behaviors on platforms that could provide them.

In practice, even very loosely-specified behaviors can often be very useful for programs which share the following two requirements with the vast majority of programs written today:

When given valid input, produce correct output.
When given invalid input, do not launch nuclear missiles.

Unfortunately, someone came up with the idea that if the C Standard does not mandate the behavior of some action X in a particular situation Y, even if most compilers happen to have behavior which would be adequate for programs seeking to meet the above requirements (e.g. most compilers will generate for the expression p < q code that will either yield 0 or 1 and have no other side-effects, even when p and q identify unrelated objects), then action X should be regarded as an indication to the compiler that the program will never receive any input which would cause situation Y.

The indicated (*p=*p) & (*q=*q) is intended to represent such a "promise". The logic is that since the Standard wouldn't say anything about what a compiler can do if p==q, a compiler should assume that the programmer wouldn't mind if the program would launch nuclear missiles in response to any input that could cause the code to be executed when p==q.

That idea and its consequences are fundamentally antithetical to the very nature and design objectives of C and its use a systems-programming language. Nearly all systems offer some features and guarantees beyond those mandated by the Standard, though the specifics vary from one system to the next. I consider preposterous the idea that the language is better served by redefining x < y from "I'm willing to accept whatever means of pointer comparison is used by any hardware on which this program is actually going to be run" to "I'm so certain that these two pointers will be related that I would stake my life on it", than it would be by adding a new means of directing the compiler to assume that "x and y are related pointers", but somehow it seems to be becoming accepted.

score 1 · Answer 5 · answered Jun 29 '15 at 13:09

1

The question of this thread starts with "Why does (*p=*p) & (*q=*q); in C trigger undefined behavior if p and q are equal?" and the questioneer refers to an article that reasons that the new restrict keyword in C (and C++?) are unnecessary because we can tell the compiler this by writing an expression (*p=*p) & (*q=*q);.

The explanation of this expression by user Iwillnotexist Idonotexist is very thorough...and very complex. Basically, the conclusion is that this is rather a directive than a statement since the expression yields no result that is used and only has side-effects (assignment to self) that have no effects (self remains unchanged, even if p==q), so any good compiler may optimize it out.

Still not grasping completely the explanation, I opt for that new keyword and not write the wrong expression.

answered Jun 29 '15 at 13:09

Paul Ogilvie

25,048
4
23
41

I'll attempt to distill my explanation above to the essential. `(*p=*p) & (*q=*q);` is intentionally put in the code because _even though it is optimized out_ (it has no effects), if the compiler _did leave it in_, it would execute invalid C code _specifically for the case `p == q`_. The compiler is _not obliged_ to make the code work for a broader range of cases than the C code specifies; So if the code _intentionally executes invalid C code_ for a given case (here, if `p == q` holds), then the compiler is _allowed to assume_ that the given case does not occur, and optimize accordingly. – Iwillnotexist Idonotexist Jun 29 '15 at 13:49
The notion that it is better to use Undefined Behavior rather than compiler directives to control a compiler's assumptions is, by far, the worst thing ever to be "added" to the C language. The vast majority of programs have two requirements: (1) When given valid input, give correct output; (2) When given invalid input, do not launch nuclear missiles. In many cases, even very loose guarantees about program behavior in cases not specified by the Standard would allow programs which need to meet both requirements to do so much more cheaply than would be possible in strictly-conforming C code. – supercat Jun 29 '15 at 16:53
Given `int32_t x,y,z;` if code needs to assign to `x` the value `y+z`, in cases where the sum is in range (which it will be given valid input), and may either raise SIGFPE or yield a result which behaves like any number (not necessarily constrained to INT32_MIN..INT32_MAX) in case of overflow, but must refrain from launching nuclear missiles in any case, writing the computation as `x=((uint32_t)y+z);` might work, but would needlessly constrain behavior on some 64-bit platforms and is far less clear than would be `x=y+z;`--the notation that would historically have been the best. – supercat Jun 29 '15 at 17:02
@Iwillnotexist Idonotexist, "_...if the compiler did leave it in, it would execute invalid C code specifically for the case `p == q`._ " No, in no case will this result in invalid C code, i.e. in unintended/unexpected results, including side-effects. If `p == q` the expression degrades to `(*p=*p) & (*p=*p);`. If the compiler would compile it and it would be executed, the result is still nil. So tweaking the standard to come up with rare things that compiler builders must implement is not the way forward. C is a "chain saw" language. Let's keep it that way. It always did exactly what I wanted. – Paul Ogilvie Jun 29 '15 at 19:13
1

@PaulOgilvie That's where the Standard disagrees. The degenerate expression `(*p=*p) & (*p=*p);` causes UB by my argument above (_>1 assignment to one variable between two sequence points_). In theory, a compiler that sees this expression thus has the _right_, though not the obligation, to assume that any code path leading to or following it won't ever occur. I don't necessarily agree with this decision of the Standard, but that is what it says. I do agree with you that most compilers would _not_ launch nukes if `p == q`, and would instead optimize the expression out. – Iwillnotexist Idonotexist Jun 29 '15 at 19:39

Why does (*p=*p) & (*q=*q); in C trigger undefined behavior

5 Answers5

P1

P2

P3

P1+P2+P3 -> C1

Why does (p=p) & (q=q); in C trigger undefined behavior