-1

EDIT: The original word choice was confusing. The term "symbolic" is much better than the original ("mystical").

In the discussion about my previous C++ question, I have been told that pointers are

This does not sound right! If nothing is symbolic and a pointer is its representation, then I can do the following. Can I?

#include <stdio.h>
#include <string.h>

int main() {
    int a[1] = { 0 }, *pa1 = &a[0] + 1, b = 1, *pb = &b;
    if (memcmp (&pa1, &pb, sizeof pa1) == 0) {
        printf ("pa1 == pb\n");
        *pa1 = 2;
    }
    else {
        printf ("pa1 != pb\n");
        pa1 = &a[0]; // ensure well defined behaviour in printf
    }
    printf ("b = %d *pa1 = %d\n", b, *pa1);
    return 0;
 }

This is a C and C++ question.

Testing with Compile and Execute C Online with GNU GCC v4.8.3: gcc -O2 -Wall gives

pa1 == pb                                                                                                                                                                                       
b = 1 *pa1 = 2    

Testing with Compile and Execute C++ Online with GNU GCC v4.8.3: g++ -O2 -Wall

pa1 == pb                                                                                                                                                                                       
b = 1 *pa1 = 2        

So the modification of b via (&a)[1] fails with GCC in C and C++.

Of course, I would like an answer based on standard quotes.

EDIT: To respond to criticism about UB on &a + 1, now a is an array of 1 element.

Related: Dereferencing an out of bound pointer that contains the address of an object (array of array)

Additional note: the term "mystical" was first used, I think, by Tony Delroy here. I was wrong to borrow it.

curiousguy
  • 8,038
  • 2
  • 40
  • 58
  • 6
    Your sample code has UB. – πάντα ῥεῖ Aug 17 '15 at 08:35
  • 4
    The compiler is free to arrange variables, to your code may work as you expect or it may not. It's undefined bahaviour. – Jabberwocky Aug 17 '15 at 08:36
  • 3
    [expr.add]/5 "[for pointer addition, ] if both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined." – TartanLlama Aug 17 '15 at 08:42
  • @TartanLlama In case it makes a difference, I have changed `a` to an array. – curiousguy Aug 17 '15 at 08:45
  • 2
    Dereferencing `&a + 1` is undefined, and the compiler is free to assume that doing it does not modify `b` and instead inline `b`'s value. – molbdnilo Aug 17 '15 at 08:46
  • 2
    @curiousguy : Why ? Because the standard doesn't require the compiler to arrage variables in a specific way. – Jabberwocky Aug 17 '15 at 08:46
  • 2
    @curiousguy it doesn't make a difference, `b` is not an element of the array, so the behaviour is undefined. – TartanLlama Aug 17 '15 at 08:46
  • @molbdnilo So two pointers with equal values can have different semantic values? – curiousguy Aug 17 '15 at 08:49
  • 2
    @curiousguy Yes, an invalid pointer has different semantics than a valid one. In particular, dereferencing an invalid pointer makes your entire program undefined. – molbdnilo Aug 17 '15 at 08:52
  • @molbdnilo What is an "invalid pointer"? – curiousguy Aug 17 '15 at 09:06
  • @curiousguy: a pointer is invalid when it does not point at an object, a member of an array, or one past the end of an array. – Zan Lynx Aug 17 '15 at 09:09
  • @ZanLynx With the change of `a` to array of 1 int, the pointer it is valid. – curiousguy Aug 17 '15 at 09:09
  • 2
    @curiousguy: You're allowed to have a pointer one past the end. But you aren't allowed to dereference it. There's nothing there. Also, the compiler is allowed to look at your pointer use and reduce everything it sees. So you declare b and you declare pointers. But the compiler is free to delete all of that and in fact reduce your entire program to one print statement if it feels like it. – Zan Lynx Aug 17 '15 at 09:12
  • 1
    @curiousguy the *value* of a pointer to the hypothetical element after an array is well-defined, but dereferencing it is undefined behaviour. – TartanLlama Aug 17 '15 at 09:12
  • @ZanLynx So a pointer is more than its bit pattern. – curiousguy Aug 17 '15 at 09:15
  • 2
    @curiousguy: On x86 and x64 it is a bit pattern. The compiler assumes that all code follows the rules and it may not notice that you changed the bit pattern. Or it might move things into registers and remove the pointers entirely, causing your "clever thing" to disappear. If you don't follow the rules, the compiler optimizations *will* destroy you. – Zan Lynx Aug 17 '15 at 09:18
  • 3
    @curiousguy Yes, it "is" more than a bit pattern, even though the bit pattern is the entire representation. And so are `int`s, `float`s, and everything else. Using the value of an uninitialised `int` object is also undefined, regardless of the bit pattern it stores. – molbdnilo Aug 17 '15 at 09:19
  • @ZanLynx "_it may not notice that you changed the bit pattern"_ I did not – curiousguy Aug 17 '15 at 09:22
  • @Jabberwocky "_Because the standard doesn't require the compiler to arrage variables in a specific way._" Of course the compiler could randomize the addresses of complete objects. But then, during every program run, the addresses once set are well defined and can be used for mathematical computations are an address is just a number. When the compiler has "arranged" the objects in memory, it is committed to this "arrangement" at least during this program execution, and I can play. – curiousguy Jun 07 '18 at 03:36
  • @molbdnilo Would you agree that two pointers with the same value are either both valid or both invalid? – curiousguy Jun 07 '18 at 03:38
  • @ZanLynx "_Also, the compiler is allowed to look at your pointer use and reduce everything it sees_" This is a language-lawyer question. Please provide a quote. – curiousguy Jun 15 '18 at 01:32
  • @curiousguy It is the as-if rule, see http://en.cppreference.com/w/cpp/language/as_if and https://stackoverflow.com/a/15718279/13422 the answer there has a reference to parts of the C++11 standard. – Zan Lynx Jun 15 '18 at 02:38
  • "_The "as-if" rule basically defines what transformations an implementation is allowed to perform on a legal C++ program_" Yes and nobody has been able to point to **a rule explicitly allowing that transformation**. – curiousguy Jul 02 '18 at 07:16

4 Answers4

8

The first thing to say is that a sample of one test on one compiler generating code on one architecture is not the basis on which to draw a conclusion on the behaviour of the language.

c++ (and c) are general purpose languages created with the intention of being portable. i.e. a well formed program written in c++ on one system should run on any other (barring calls to system-specific services).

Once upon a time, for various reasons including backward-compatibility and cost, memory maps were not contiguous on all processors.

For example I used to write code on a 6809 system where half the memory was paged in via a PIA addressed in the non-paged part of the memory map. My c compiler was able to cope with this because pointers were, for that compiler, a 'mystical' type which knew how to write to the PIA.

The 80386 family has an addressing mode where addresses are organised in groups of 16 bytes. Look up FAR pointers and you'll see different pointer arithmetic.

This is the history of pointer development in c++. Not all chip manufacturers have been "well behaved" and the language accommodates them all (usually) without needing to rewrite source code.

Richard Hodges
  • 68,278
  • 7
  • 90
  • 142
  • The compiler generated is simply is illustration of the fact that GCC doesn't support this crazy idea. It isn't used as "proof" of anything, and it doesn't work the modified code (the one with the array). – curiousguy Aug 18 '15 at 20:09
  • 1
    C was designed so that the *language* could be ported to many machines, and that a programmer who was familiar with C and familiar with the general characteristics of a particular architecture would know to write C code for that architecture. The design of the language is hostile to the writing of architecture-agnostic code. On the other hand, the reason C became popular is that it didn't try to be "one language", but instead a family of dialects that could exploit the various strengths of different architectures. – supercat Jul 12 '18 at 06:54
  • 2
    @supercat when you write "The design of the language is hostile to the writing of architecture-agnostic code." I have to say that this conflicts with my life experience. As written above, I have written C on systems based on Z80, 6502, 6809, 68000, 80x86 and TMS9900, both with and without paged memory and with all kinds of I/O mappings. The C language (and a couple of portability macros) allowed the same same source code to compile into functional programs (and mini-OS) for all these systems. The only points of customisation were a few macro definitions, device drivers and linker maps. – Richard Hodges Jul 12 '18 at 08:08
  • 1
    @RichardHodges: There's a difference between writing code that will work on a *particular* set of architectures, and writing code that is truly architecture agnostic. The preprocessor can help a lot with portability issues, but a language designed to facilitate architecture-agnostic code would specify that math will behave in two's-complement fashion, even if that means using unsigned math on the underlying architecture and then adding code to handle the situations where it behaves differently from signed. It would also specify architecture-independent promotion rules for.. – supercat Jul 12 '18 at 14:49
  • ..."fixed-sized" types. Writing a Java implementation for something like a 36-bit machine would be "interesting", but if the platform supports compare-and-swap, or if the implementation runs on a single core and gets to control scheduling if its threads, I think it would be possible to achieve halfway-decent performance. By contrast, most C programs written for common microprocessors would be completely useless on a 36-bit machine. – supercat Jul 12 '18 at 14:55
  • 2
    @supercat I would agree that not all C programs are well written. It is worth noting that C compilers existed for DEC and IBM architectures which had 9-bit chars and 36-bit words. The integral type sizes in C were deliberately vague for precisely this reason. Writers of portable programs don't as a rule seek to depend on integer overflow behaviour. – Richard Hodges Jul 12 '18 at 16:05
  • 1
    @RichardHodges: I've written a TCP stack on a platform with 16-bit "char", and using a language that was pretty much like normal C except for the 16-bit char was definitely nicer than writing everything in TMS3205x assembly code would have been, but the language did nothing to help with making my code be architecture-agnostic. A language designed to let people write architecture-agnostic code should include data types with architecture-agnostic semantics, even if they need to be emulated or even make certain programs incompatible with some platforms. For performance, it may *also*... – supercat Jul 12 '18 at 19:49
  • ...have "native" data types, but my job would have been a lot easier if there were a means of declaring a "16 bits stored as two octets little-endian" data type and have a compiler generate code that would split a write of such a value into two "char"-sized writes [using the bottom 8 bits of each "char"]. If such a type existed, a TCP stack for the PC that used such types would have been easily portable to the TMS part. Parts of it may have performed unacceptably slowly using such emulated types, and thus had to be hand-tweaked to use native types, but that would have been nicer... – supercat Jul 12 '18 at 19:54
  • 2
    @supercat can't disagree with that. I had to define pseudo-types for such concepts as "index into array" as signed/unsigned types of 16/8 bits implied vastly difference performance and space characteristics between Z80, 6809 etc. Still, the end result was 100% portable with only 2 hours of configuration. – Richard Hodges Jul 13 '18 at 07:17
  • 1
    @RichardHodges: It may have been 100% portable among quality general-purpose implementations suitable for low-level programming on a certain subset of platforms, but it would not be "portable" in the sense that the Standard uses the term, nor would it necessarily be reliably portable among "modern" compilers for those platforms. – supercat Jul 13 '18 at 15:47
4

Stealing the quote from TartanLlama:

[expr.add]/5 "[for pointer addition, ] if both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined."

So the compiler can assume that your pointer points to the a array, or one past the end. If it points one past the end, you cannot defererence it. But as you do, it surely can't be one past the end, so it can only be inside the array.

So now you have your code (reduced)

b = 1;
*pa1 = 2;

where pa points inside an array a and b is a separate variable. And when you print them, you get exactly 1 and 2, the values you have assigned them.

An optimizing compiler can figure that out, without even storing a 1or a 2 to memory. It can just print the final result.

Bo Persson
  • 90,663
  • 31
  • 146
  • 203
  • "_If it points one past the end, you cannot defererence it_" This one isn't clear; what does "point" mean? – curiousguy Aug 18 '15 at 20:17
  • You know fine well what it means. It holds the address of the hypothetical `a[N]`, i.e. if the array were 1 element larger, it would point at the final element. The real questions: Why on Earth have you made so many questions about this concept? Would it be useful for anything if it weren't UB? – underscore_d Feb 27 '16 at 18:01
  • @underscore_d If a pointer is a trivial type, than two pointers with the same representation must be point to the same set of things. So a one past the end pointer with the same representation as the pointer to the object after the array must point to that object. Are you pretending that pointers aren't really trivial types? – curiousguy Jun 13 '18 at 23:03
  • I'm not pretending anything, but you appear to have pretended a 2nd pointer exists when none did in this discussion.. The 1st pointer, i.e. the one that actually exists in this discussion, points one past the end and is valid to form, but not to dereference. – underscore_d Jun 14 '18 at 08:22
  • @underscore_d Are you saying that a pointer cannot be both one past the end and pointing to an object? I'm struggling with that. – curiousguy Jul 02 '18 at 07:21
  • 1
    @curiousguy - Yes, that's probably what he is saying. If you have a pointer and move it past the end, it no longer points to an object. Now you could also have *some other* pointer that does point to an object and that object just could have the same address as the past-the-end. But they are still *different* pointers and not interchangable. – Bo Persson Jul 02 '18 at 07:29
  • 1
    @BoPersson I have no problem with the idea that two objects can be equal by any allowed programmatic measurement and still be different. (It just means that the means of measurements are limited.) It's more difficult to accept that two value storing objects can be different if they store equal values. We know that pointers are value storing objects in all currently used compilers. There is no hidden flag in a pointer representation that wouldn't be measurable by `==`. (This can be confirmed by `memcmp`.) That's my difficulty. – curiousguy Jul 02 '18 at 07:36
  • 1
    Not only that, we also know that a pointer value can be converted to an integer and back to a pointer, so the integer must fully represent the complete value of the pointer. So two pointers with identical representation will be converted to equal integer values. Are you saying that integers can hold the same value but still be different? – curiousguy Jul 02 '18 at 07:39
  • @curiousguy - Those are the rules. :-) The rules were set at a time when segmented memory was still common. And segments could overlap, so `memcmp` wasn't reliable - different bit patterns `segment:offset` could mean *the same* address. And vice versa - with arrays allocated in separate segments, the same pointer bit pattern meant different objects depending on which segment was used as a base. – Bo Persson Jul 02 '18 at 07:49
  • @BoPersson I understand that a given value for a type can have many different representation. That could also be the case with a fraction class where different fraction representations are indistinguishable by any allowed measurement while still not comparing equal via `memcmp`, which is not an "allowed measurement" for such type. But two fractions with identical representation must be equal. This is implied by the fact that the intrinsic value of a fraction object is determined ONLY by the state of its members. – curiousguy Jul 02 '18 at 07:55
  • "_the same pointer bit pattern meant different objects_" How would the compile manage to make an access to the right object, given an ambiguous pointer value? – curiousguy Jul 02 '18 at 07:58
  • @curiousguy - It's part of the `segment:offset` addressing. The segment part had to be loaded into a segment register, and then you could use just the offset as a pointer into an array stored in that segment. To move to a different array the compiler would have to reload the segment register and then use another set of pointers. – Bo Persson Jul 02 '18 at 08:52
  • 1
    @curiousguy: Except in when using `huge` pointers (which are seldom used, because they are extremely slow and inefficient), all accesses made to a particular object will use the same segment, and a compiler will assume that two pointers with different segments cannot identify the same object or portions thereof. Consequently, individual objects are generally limited to 65520 (i.e. 65536-16) bytes. The answer to when a compiler should change the segment part of a non-huge pointer to an object is simply: never. – supercat Jul 13 '18 at 15:39
3

If you turn off the optimiser the code works as expected.

By using pointer arithmetic that is undefined you are fooling the optimiser. The optimiser has figured out that there is no code writing to b, so it can safely store it in a register. As it turns out, you have acquired the address of b in a non-standard way and modify the value in a way the optimiser doesn't see.

If you read the C standard, it says that pointers may be mystical. gcc pointers are not mystical. They are stored in ordinary memory and consist of the same type of bytes that make up all other data types. The behaviour you encountered is due to your code not respecting the limitations stated for the optimiser level you have chosen.

Edit:

The revised code is still UB. The standard doesn't allow referencing a[1] even if the pointer value happens to be identical to another pointer value. So the optimiser is allowed to store the value of b in a register.

Klas Lindbäck
  • 33,105
  • 5
  • 57
  • 82
  • Comments are not for extended discussion; this conversation has been [moved to chat](http://chat.stackoverflow.com/rooms/87300/discussion-on-answer-by-klas-lindback-are-pointer-variables-just-integers-with-s). – Martijn Pieters Aug 18 '15 at 17:38
  • 1
    The optimizers in gcc and clang treats pointers as mystical. They also treats values of type `uintptr_t` as mystical. If `int *p` can be used to access an object and `int *q` has the same bit pattern but cannot be used to identify the object, gcc's optimizer will even go so far in some cases as to say assume in some cases where `uintptr_t uptr` is known to be equal to `(uintptr_t)q`, an access to `(int*)uptr` won't affect `*p`, even if the value in `uptr` happens to actually be derived from `(uintptr)p`. – supercat Jul 13 '18 at 16:30
  • @supercat "_even if the value (...)_" when would that happen? – curiousguy Apr 14 '19 at 12:56
  • @curiousguy:Given `#include extern int x,y[]; int test(uintptr_t z) { x = 1; if (z == (uintptr_t)(1+y)) { *(int*)z=2; } return x; } ` gcc will ignore the possibility that `*(int*)z` might identify `x`, even though the behavior of `test((uintptr_t)&x)` should be defined as always either returning 1 with no side-effect, or writing `2` to `x` and then returning 2. – supercat Apr 15 '19 at 17:06
2

C was conceived as a language in which pointers and integers were very intimately related, with the exact relationship depending upon the target platform. The relationship between pointers and integers made the language very suitable for purposes of low-level or systems programming. For purposes of discussion below, I'll thus call this language "Low-Level C" [LLC].

The C Standards Committee wrote up a description of a different language, where such a relationship is not expressly forbidden, but is not acknowledged in any useful fashion, even when an implementation generates code for a target and application field where such a relationship would be useful. I'll call this language "High Level Only C" [HLOC].

In the days when the Standard was written, most things that called themselves C implementations processed a dialect of LLC. Most useful compilers process a dialect which defines useful semantics in more cases than HLOC, but not as many as LLC. Whether pointers behave more like integers or more like abstract mystical entities depends upon which exact dialect one is using. If one is doing systems programming, it is reasonable to view C as treating pointers and integers as intimately related, because LLC dialects suitable for that purpose do so, and HLOC dialects that don't do so aren't suitable for that purpose. When doing high-end number crunching, however, one would far more often being using dialects of HLOC which do not recognize such a relationship.

The real problem, and source of so much contention, lies in the fact that LLC and HLOC are increasingly divergent, and yet are both referred to by the name C.

supercat
  • 77,689
  • 9
  • 166
  • 211