16

According to the C++ draft expr.add when you subtract pointers of the same types, but not belonging to the same array, the behavior is undefined (emphasis is mine):

When two pointer expressions P and Q are subtracted, the type of the result is an implementation-defined signed integral type; this type shall be the same type that is defined as std::ptrdiff_­t in the header ([support.types]).

  • If P and Q both evaluate to null pointer values, the result is 0. (5.2)

  • Otherwise, if P and Q point to, respectively, elements x[i] and x[j] of the same array object x, the expression P - Q has the value i−j.

  • Otherwise, the behavior is undefined. [ Note: If the value i−j is not in the range of representable values of type std::ptrdiff_­t, the behavior is undefined. — end note  ]

What is the rationale for making such behavior undefined instead of, for instance, implementation-defined?

Community
  • 1
  • 1
αλεχολυτ
  • 4,792
  • 1
  • 35
  • 71
  • 12
    What meaning would the resulting value have? – Blaze May 08 '19 at 08:14
  • 3
    i dont think there would be much difference if it was implementation defined, you probably would have to read in your compilers documentation that it is undefined ;) – 463035818_is_not_an_ai May 08 '19 at 08:15
  • @Blaze for instance, compiler could interpret them like pointers to the same array, and merely returns value like in previous clause, at least it will eliminate UB from the program. – αλεχολυτ May 08 '19 at 08:20
  • @user463035818 I'm not sure that implementation defined behavior can be interpreted as undefined. – αλεχολυτ May 08 '19 at 08:23
  • 2
    @αλεχολυτ what I'm getting at is that the resulting value is nonsensical, there's no way to use it properly. Instead of forcing the compiler to generate a nonsense value, the standard just says that it's UB, allowing the compiler to salvage that situation however it wants, possibly by not even doing the subtraction and thus saving time. I mean, it could just optimize away the line and the value is whatever was in memory to begin with, the result would be just as useless. In general, leaving things up to the implementation generates potential for possible optimization. – Blaze May 08 '19 at 08:26
  • 3
    What if the objects are in different memory segments? There's no meaningful "difference" then. – melpomene May 08 '19 at 08:34
  • 1
    @Blaze Assuming a linear memory layout, the resulting value isn't entirely non-sensical. I've seen code that actually relies on pointer arithmetic across separate arrays. For example, setting `d = p - q` and later assuming that `q + d` yields `p`. – nwellnhof May 08 '19 at 08:35
  • @Qubit this being UB actually would invalidate existing `offsetof` implementations, because it is based on subtraction of pointer to object from pointer to member variable of that object. Moreof, `offsetof` often uses `nullptr` as a pointer to object. Which result in some old reflection-capable libraries to become based on UB, although new standard have proposals to include language-implemented reflection – Swift - Friday Pie May 08 '19 at 08:39
  • 1
    @nwellnhof "Assuming a linear memory layout" You don't get to make that assumption. You've seen code that depends on a particular symptom of undefined behaviour remaining consistent. – Caleth May 08 '19 at 08:48
  • I think that the decision between undefined/implementation defined is somewhat arbitrary. For example, left-shifting a negative value is undefined in C++17 (but most likely it will be well-defined in the next standard). But there is no real reason that it is undefined. It could have been implementation-defined as well. The only reason I could think of is that the optimizer has more possibilities, if it is undefined. – geza May 08 '19 at 08:49
  • 1
    @αλεχολυτ what i meant is just that implementation defined just means the implementation defines it, but that does not change that it is not obvious how to define it and afaik an implementation could as well declare it as undefined – 463035818_is_not_an_ai May 08 '19 at 09:16
  • 1
    If this were to be implementation defined, then there is not really a reason to make anything undefined behavior ... – L. F. May 08 '19 at 10:05
  • 1
    Consider also that if `sizeof(T) > alignof(T)` then if `P` and `Q` don't point to elements of the same array you can easily end up with a non-integer multiple of `sizeof(T)` bytes between the two pointers. That would prevent any consistent implementation that preserves the invariant `Q + (P - Q) == P.` – Joe Lee-Moyet May 08 '19 at 13:42
  • @L.F. some people would agree with such a sentiment – M.M May 08 '19 at 14:07
  • @JoeLee-Moyet makes a really good [practical] point! – Lightness Races in Orbit May 08 '19 at 14:10
  • When you consider registers, and in particular registers used for SIMD instructions, this restriction makes a lot of sense. If a variable gets compiled down a range of bits in ymm3 or whatever, then it makes sense that you could also have a pointer to that variable and then dereference it. It *doesn't* make sense to subtract that pointer from a pointer to another variable stored in rax or rdx or whatever. – hegel5000 May 08 '19 at 14:28

3 Answers3

16

Speaking more academically: pointers are not numbers. They are pointers.

It is true that a pointer on your system is implemented as a numerical representation of an address-like representation of a location in some abstract kind of memory (probably a virtual, per-process memory space).

But C++ doesn't care about that. C++ wants you to think of pointers as post-its, as bookmarks, to specific objects. The numerical address values are just a side-effect. The only arithmetic that makes sense on a pointer is forwards and backwards through an array of objects; nothing else is philosophically meaningful.

This may seem pretty arcane and useless, but it's actually deliberate and useful. C++ doesn't want to constrain implementations to imbuing further meaning to practical, low-level computer properties that it cannot control. And, since there is no reason for it to do so (why would you want to do this?) it just says that the result is undefined.

In practice you may find that your subtraction works. However, compilers are extremely complicated and make great use of the standard's rules in order to generate the fastest code possible; that can and often will result in your program appearing to do strange things when you break the rules. Don't be too surprised if your pointer arithmetic operation is mangled when the compiler assumes that both the originating value and the result refer to the same array — an assumption that you violated.

Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
8

As noted by some in the comments, unless the resulting value has some meaning or usable in some way, there is no point in making the behavior defined.

There has been a study done for the C language to answer questions related to Pointer Provenance (and with an intention to propose wording changes to the C specification.) and one of the questions was:

Can one make a usable offset between two separately allocated objects by inter-object subtraction (using either pointer or integer arithmetic), to make a usable pointer to the second by adding the offset to the first? (source)

The conclusion of the authors of the study was published in a paper titled: Exploring C Semantics and Pointer Provenance and with respect to this particular question, the answer was:

Inter-object pointer arithmetic The first example in this section relied on guessing (and then checking) the offset between two allocations. What if one instead calculates the offset, with pointer subtraction; should that let one move between objects, as below?

// pointer_offset_from_ptr_subtraction_global_xy.c
#include <stdio.h>
#include <string.h>
#include <stddef.h>

int x=1, y=2;
int main() {
    int *p = &x;
    int *q = &y;
    ptrdiff_t offset = q - p;
    int *r = p + offset;
    if (memcmp(&r, &q, sizeof(r)) == 0) {
        *r = 11; // is this free of UB?
        printf("y=%d *q=%d *r=%d\n",y,*q,*r);
    }
}

In ISO C11, the q-p is UB (as a pointer subtraction between pointers to different objects, which in some abstract-machine executions are not one-past-related). In a variant semantics that allows construction of more-than-one-past pointers, one would have to to choose whether the *r=11 access is UB or not. The basic provenance semantics will forbid it, because r will retain the provenance of the x allocation, but its address is not in bounds for that. This is probably the most desirable semantics: we have found very few example idioms that intentionally use inter-object pointer arithmetic, and the freedom that forbidding it gives to alias analysis and optimisation seems significant.

This study was picked up by the C++ community, summarized and was sent to WG21 (The C++ Standards Committee) for feedback.

Relevant point of the Summary:

Pointer difference is only defined for pointers with the same provenance and within the same array.

So, they have decided to keep it undefined for now.

Note that there is a study group SG12 within the C++ Standards Committee for studying Undefined Behavior & Vulnerabilities. This group conducts a systematic review to catalog cases of vulnerabilities and undefined/unspecified behavior in the standard, and recommend a coherent set of changes to define and/or specify the behavior. You can keep track of the proceedings of this group to see if there are going to be any changes in the future to the behaviors that are currently undefined or unspecified.

Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
P.W
  • 26,289
  • 6
  • 39
  • 76
5

First see this question mentioned in the comments for why it isn't well defined. The answer given concisely is that arbitrary pointer arithmetic is not possible in segmented memory models used by some (now archaic?) systems.

What is the rationale to make such behavior undefined instead of, for instance, implementation defined?

Whenever standard specifies something as undefined behaviour, it usually could be specified merely to be implementation defined instead. So, why specify anything as undefined?

Well, undefined behaviour is more lenient. In particular, being allowed to assume that there is no undefined behaviour, a compiler may perform optimisations that would break the program if the assumptions weren't correct. So, a reason to specify undefined behaviour is optimisation.

Let's consider function fun(int* arr1, int* arr2) that takes two pointers as arguments. Those pointers could point to the same array, or not. Let's say the function iterates through one of the pointed arrays (arr1 + n), and must compare each position to the other pointer for equality ((arr1 + n) != arr2) in each iteration. For example to ensure that the pointed object is not overridden.

Let's say that we call the function like this: fun(array1, array2). The compiler knows that (array1 + n) != array2, because otherwise behaviour is undefined. Therefore the if the function call is expanded inline, the compiler can remove the redundant check (arr1 + n) != arr2 which is always true. If pointer arithmetic across array boundaries were well (or even implementation) defined, then (array1 + n) == array2 could be true with some n, and this optimisation would be impossible - unless the compiler can prove that (array1 + n) != array2 holds for all possible values of n which can sometimes be more difficult to prove.


Pointer arithmetic across members of a class could be implemented even in segmented memory models. Same goes for iterating over the boundaries of a subarray. There are use cases where these could be quite useful, but these are technically UB.

An argument for UB in these cases is more possibilities for UB optimisation. You don't necessarily need to agree that this is a sufficient argument.

eerorika
  • 232,697
  • 12
  • 197
  • 326
  • Ah, I'm confusing the rules for ordering pointers. `==` and `!=` are well defined for pointers to objects of the same type (or `void *`) – Caleth May 08 '19 at 10:55
  • @Caleth Cool. That's what I remembered :) The relational operators aren't themselves UB either (at least in latest draft). It's just that the order is unspecified, so they don't impose a strict ordering, which may lead to violation of some preconditions. – eerorika May 08 '19 at 10:59