6

According to the C standard:

When two pointers are subtracted, both shall point to elements of the same array object, or one past the last element of the array object (sect. 6.5.6 1173)

[Note: do not assume that I know much of the standard or UB, I just happen to have found out this one]

  1. I understand that in almost all cases, taking the difference of pointers in two different arrays would be a bad idea anyway.
  2. I also know that on some architectures ("segmented machine" as I read somewhere), there are good reasons that the behavior is undefined.

Now on the other hand

  1. It may useful in some corner cases. For example, in this post, it would allow to use a library interface with different arrays, instead of copying everything in one array that will be split just after.
  2. It seems that on "ordinary" architectures, the way of thinking "all objects are stored in a big array starting at approx. 0 and ending at approx. memory size" is a reasonable description of the memory. When you actually do look at pointer differences of different arrays, you get sensible results.

Hence my question: from experiment, it seems that on some architectures (e.g. x86-64), pointer difference between two arrays provides sensible, reproducible results. And it seems to correspond reasonably well to the hardware of these architectures. So does some implementation actually insure a specific behavior?

For example, is there an implementation out there in the wild that guarantees for a and b being char*, we have a + (reinterpret_cast<std::ptrdiff_t>(b)-reinterpret_cast<std::ptrdiff_t>(a)) == b?

Bérenger
  • 2,678
  • 2
  • 21
  • 42
  • Well, the 'gap' between different elements of a *single* array is well-defined. However, who's to say that the 'gap' between two different arrays will even be fixed between different invocations of the same program (if they're stored on the stack), or between different *builds* of the same program (if they're not). – Adrian Mole Sep 15 '20 at 16:01
  • 2
    "where of course, for some architectures, implementation-defined will specify it as UB" I am not quite sure they can do that. Implementation-defined code is valid. – Quimby Sep 15 '20 at 16:01
  • 3
    "where of course, for some architectures, implementation-defined will specify it as UB" - it's the other way round. Standard can specify something as UB, but implementation can say it's defined. Implementation cannot reject valid (and implementation-defined is valid) code, because it would be non-conformant. – Yksisarvinen Sep 15 '20 at 16:03
  • A pointer difference is the byte difference divided by the byte size of an element. If the pointers are to two different element types, which size should be used? – Mark Ransom Sep 15 '20 at 16:03
  • There is no meaning that can be obtained from subtracting pointers from elements that don't belong to the same array. What do you expect `&(a[0]) - &(b[0])` to represent for two different arrays `a` and `b`? Whatever the definition you come up with, if it is meaningful you can probably come up with that information otherwise without pointer arithmetic. – François Andrieux Sep 15 '20 at 16:03
  • 1
    @Quimby I also think so. And I think there is no mechanism in the standard to handle this type of cases because then it would mean there it is no longer portable – Bérenger Sep 15 '20 at 16:04
  • @MarkRansom I agree but the standard also prevents it if the types are the same – Bérenger Sep 15 '20 at 16:08
  • @FrançoisAndrieux See the linked post for an example where you would need this information. The idea is to say: I have an array starting at `&(b[0])` and the second part of the information is at distance `&(a[0]) - &(b[0])` from it – Bérenger Sep 15 '20 at 16:10
  • 1
    Implementation defined means that there is a choice that the implementation must make, and document. For example, the allowed representations of an `int` type are two's complement, one's complement, or sign magnitude. Those are the only allowed representations, and every implementation must choose one, and document the choice. With UB, there are no limits on what the implementation can do, and the implementation doesn't need to document what it does. – user3386109 Sep 15 '20 at 16:13
  • You have the tag C and C++, why? Please ask only for one specific language except when you ask about differences, similarities and so on. – 12431234123412341234123 Sep 15 '20 at 16:27
  • From the comments in the linked question, it sounds like you want to treat the whole stack as a large array and provide offsets where your data is located. Is this correct? – François Andrieux Sep 15 '20 at 16:29
  • @12431234123412341234123 In case the C and C++ standards or implementations would disagree – Bérenger Sep 15 '20 at 16:35
  • *"is there an implementation out there in the wild that guarantees"* This is doubtful because `reinterpret_cast` does not say that anything is implementation defined except for the conversion from function pointer to `void*`. So implementation are not required to document what exactly the operation does or how it works. Since they are not requirement to, they likely won't document it. – François Andrieux Sep 15 '20 at 16:35
  • 2
    Hopefully research of `far` pointers will answer this question – Mooing Duck Sep 15 '20 at 16:40
  • @FrançoisAndrieux (Regarding the previous comment) No not exactly. Lets say I have 3 arrays. The interface only wants one pointer (e.g. to the first array) and then offsets to the other arrays. So you can compute these offsets be pointer differences between arrays [Your comment should be on the other question I think]. – Bérenger Sep 15 '20 at 16:41
  • Virtual Memory. Memory Paging. There is no guarantee that two arrays will be in the same memory area. For example, one array could be swapped out onto the hard drive while another is in physical memory. Also read up on "Memory mapped files". In embedded systems, the memory could be in physically different places. For example, one array could be on the System On a Chip (SOC) and the other array in RAM (or on any other device that has memory). In all these cases, subtracting of these pointers makes no sense or is not very useful. – Thomas Matthews Sep 15 '20 at 17:23
  • @Bérenger Then ask 2 questions, one for C and one for C++. C and C++ are different languages. Especially for such a question, one has `reinterpret_cast` one has not. – 12431234123412341234123 Sep 15 '20 at 18:06

5 Answers5

6

Why make it UB, and not implementation-defined? (where of course, for some architectures, implementation-defined will specify it as UB)

That is not how it works.

If something is documented as "implementation-defined" by the standard, then any conforming implementation is expected to define a behavior for that case, and document it. Leaving it undefined is not an option.

As labeling pointer difference between unrelated arrays "implementation defined" would leave e.g. segmented or Harvard architectures with no way to have a fully-conforming implementation, this case remains undefined by the standard.

Implementations could offer a defined behavior as a non-standard extension. But any program making use of such an extension would no longer be strictly conforming, and non-portable.

DevSolar
  • 67,862
  • 21
  • 134
  • 209
  • Well in theory, implementation defined can be as bad as UB. If it's not specified further than just "implementation defined", the implementation can define the result to be random. – klutt Sep 15 '20 at 16:39
  • @klutt Random results is still orders of magnitude better than Undefined Behavior. – François Andrieux Sep 15 '20 at 16:39
  • @FrançoisAndrieux Indeed. Invoking Undefined Behavior technically invalidates the whole program. – Christian Gibbons Sep 15 '20 at 16:40
  • @FrançoisAndrieux Well, that depends. Does not have to be. – klutt Sep 15 '20 at 16:41
  • For instance, when it comes to the evaluation order of function arguments, there's not to much that can go bad, because the only thing that is implementation defined is the *order* of evaluation, and not the entire *behavior*. And I know I'm taking it a bit far here. In practice, this answer is completely right. – klutt Sep 15 '20 at 16:48
  • @klutt Relying on function argument order is Unspecified Behavior which is categorically less dangerous than Undefined Behavior. Undefined Behavior automatically invalidates any program that encounters it, Unspecified Behavior means there exists multiple possible results. I don't see how that example shows that Unspecified Behavior can approach the harm of Undefined Behavior. – François Andrieux Sep 15 '20 at 16:53
  • 2
    ... and [Harvard architectures](https://en.wikipedia.org/wiki/Harvard_architecture). When one array is in code memory, and the other in data memory, the difference between the two array addresses is meaningless. – user3386109 Sep 15 '20 at 16:59
  • @FrançoisAndrieux My point is that it's the order that is unspecified in that case. Not the entire behavior. If the standard said that "the behavior of subtraction between two pointers is implementation defined", then the implementation can do whatever it wants as long as it's documented. If it instead said ""the *result* of subtraction between two pointers is implementation defined" it would be a different story. I'm just saying that just because something is implementation defined does not *guarantee* anything if not explicitly stated. – klutt Sep 15 '20 at 17:16
5

Any implementation is free to document a behaviour for which the standard does not require behaviour to be documented - it is well within the limits of the standard. The problem with implementation-defined behaviour in this case is that the implementations must then carefully document them, and when C was standardized, the committee presumably found out that the different implementations were so wildly variable, that no sensible common ground would exist, so they decided to make it UB altogether.


I do not know any compilers that do make it defined, but I know a compiler which does explicitly keep it undefined, even if you try to cheat with casts:

When casting from pointer to integer and back again, the resulting pointer must reference the same object as the original pointer, otherwise the behavior is undefined. That is, one may not use integer arithmetic to avoid the undefined behavior of pointer arithmetic as proscribed in C99 and C11 6.5.6/8.

I believe another compiler also has the same behaviour, though, unfortunately it doesn't document it in an accessible way.

That those two compilers do not define it would be a good reason to avoid depending on it in any programs, even if compiled with another compiler that would specify a behaviour, because you can never be too sure what compiler you need to use 5 years from now...

4

The more implementation-defined behavior you have and someone's code depends on, the less portable that code is. In this case, there's already an implementation-defined way out of this: reinterpret_cast the pointers to integers and do your math there. That makes it clear to everyone that you're relying on behavior specific to the implementation (or at least, behavior that may not be portable everywhere).

Plus, while the runtime environment may in fact be "all objects are stored in a big array starting at approx. 0 and ending at approx. memory size," that is not true of the compile-time behavior. At compile-time, you can get pointers to objects and do pointer arithmetic on them. But treating such pointers as just addresses into memory could allow a user to start indexing into compiler data and such. By making such things UB, it makes it expressly forbidden at compile-time (and reinterpret_cast is explicitly disallowed at compile-time).

Nicol Bolas
  • 449,505
  • 63
  • 781
  • 982
  • Ok I think this is indeed the right way to see it. But then the next question would be: what do typical implementation say about the result of this kind of `reinterpret_cast`? For example, on x86-64, with GCC, is `&a + (reinterpret_cast(&b)- reinterpret_cast(&a))` supposed to by equal to `&b`? – Bérenger Sep 15 '20 at 16:18
  • @Bérenger I would use `uintptr_t` or `intptr_t` for my casting rather than `ptrdiff_t` – Christian Gibbons Sep 15 '20 at 16:22
  • Also, `&a` is a pointer, so you are doing pointer arithmetic rather than standard integer arithmetic when adding the result of your subtraction to `&a`. – Christian Gibbons Sep 15 '20 at 16:26
  • @Bérenger It basically only guarantees that the integer you get can be converted back to get the original pointer, and that two pointers don't convert to the same integer. It can't be relied on to find the distance in bytes between arrays. There exists no portable mechanism in C++ that can do that unless the arrays are part of the same higher dimension array. – François Andrieux Sep 15 '20 at 16:26
  • @ChristianGibbons Yes the comment lacks a sizeof :D. See the OP edit – Bérenger Sep 15 '20 at 16:32
1

One big reason for saying that things are UB is to allow the compiler to perform optimizations. If you want to allow such a thing, then you remove some optimizations. And as you say, this is only (if even then) useful in some small corner cases. I would say that in most cases where this might seem like a viable option, you should instead reconsider your design.

From comments below:

I agree but the problem it that while I can reconsider my design, I can't reconsider the design of other libraries..

It is very rare that the standard adopts to such things. It has happened however. That's the reason why int *p = 0 is perfectly valid, even though p is a pointer and 0 is an int. This made it in the standard because it was so commonly used instead of the more correct int *p = NULL. But in general, this does not happen, and for good reasons.

klutt
  • 30,332
  • 17
  • 55
  • 95
1

First, I feel like we need to get some terms straight, at least with respect to C.

From the C2011 online draft:

  • Undefined behavior - behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements. Possible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message).

  • Unspecified behavior - use of an unspecified value, or other behavior where this International Standard provides two or more possibilities and imposes no further requirements on which is chosen in any instance. An example of unspecified behavior is the order in which the arguments to a function are evaluated.

  • Implementation-defined behavior - unspecified behavior where each implementation documents how the choice is made. An example of implementation-defined behavior is the propagation of the high-order bit when a signed integer is shifted right.

The key point above is that unspecified behavior means that the language definition provides multiple values or behaviors from which the implementation may choose, and there are no further requirements on how that choice is made. Unspecified behavior becomes implementation-defined behavior when the implementation documents how it makes that choice.

This means that there are restrictions on what may be considered implementation-defined behavior.

The other key point is that undefined does not mean illegal, it only means unpredictable. It means you've voided the warranty, and anything that happens afterwards is not the responsibility of the compiler implementation. One possible outcome of undefined behavior is to work exactly as expected with no nasty side effects. Which, frankly, is the worst possible outcome, because it means as soon as something in the code or environment changes, everything could blow up and you have no idea why (been in that movie a few times).

Now to the question at hand:

I also know that on some architectures ("segmented machine" as I read somewhere), there are good reasons that the behavior is undefined.

And that's why it's undefined everywhere. There are some architectures still in use where different objects can be stored in different memory segments, and any differences in their addresses would be meaningless. There are just so many different memory models and addressing schemes that you cannot hope to define a behavior that works consistently for all of them (or the definition would be so complicated that it would be difficult to implement).

The philosophy behind C is to be maximally portable to as many architectures as possible, and to do that it imposes as few requirements on the implementation as possible. This is why the standard arithmetic types (int, float, etc.) are defined by the minimum range of values that they can represent with a minimum precision, not by the number of bits they take up. It's why pointers to different types may have different sizes and alignments.

Adding language that would make some behaviors undefined on this list of architectures vs. unspecified on that list of architectures would be a headache, both for the standards committee and various compiler implementors. It would mean adding a lot of special-case logic to compilers like gcc, which could make it less reliable as a compiler.

John Bode
  • 119,563
  • 19
  • 122
  • 198