Why is out-of-bounds pointer arithmetic undefined behaviour?

Question

The following example is from Wikipedia.

int arr[4] = {0, 1, 2, 3};
int* p = arr + 5;  // undefined behavior

If I never dereference p, then why is arr + 5 alone undefined behaviour? I expect pointers to behave as integers - with the exception that when dereferenced the value of a pointer is considered as a memory address.

I am fairly sure the "undefined" part is just the standard saying that it cannot tell you where that pointer is pointing now. Like most pointer "undefined" things, I am sure it is fine to make it, but illegal to deference it. — , May 06 '12 at 19:35
@EthanSteinberg: That'd only be true if they said the resulting *value* was undefined. If the *behavior* is undefined, it's not safe to execute it, even if you never dereference it. — user541686, May 06 '12 at 19:36
Pointers are *not* intergers. Under the hood, the representation may coincidence, but as far as the "C++ abstract machine" is concerned, those are entirely different things that happen to share some syntax, like `struct { int a; int x; }` and `struct { char x; }`. — , May 06 '12 at 19:37
possible duplicate of [C++ Accesses an Array out of bounds gives no error, why?](http://stackoverflow.com/questions/1239938/c-accesses-an-array-out-of-bounds-gives-no-error-why) Not for the question as much as the top answer to this question. — Tony, May 06 '12 at 19:40
Because not all machines behave the same way as you PC. You are expecting a certain behavior based on how it works on your machine. The standards committee has more experience and understands that other architectures implements pointers differently thus can not guarantee the above behavior over all platforms (thus it is undefined). — Martin York, May 06 '12 at 19:48
I found a situation in which this undefined behavior actually makes the calculation wrong (on a normal x86): http://stackoverflow.com/questions/23683029/is-gccs-option-o2-breaking-this-small-program-or-do-i-have-undefined-behavior — Bernd Elkemann, Jan 28 '15 at 20:14

score 30 · Accepted Answer · edited May 06 '12 at 20:03

30

That's because pointers don't behave like integers. It's undefined behavior because the standard says so.

On most platforms however (if not all), you won't get a crash or run into dubious behavior if you don't dereference the array. But then, if you don't dereference it, what's the point of doing the addition?

That said, note that an expression going one over the end of an array is technically 100% "correct" and guaranteed not to crash per §5.7 ¶5 of the C++11 spec. However, the result of that expression is unspecified (just guaranteed not to be an overflow); while any other expression going more than one past the array bounds is explicitly undefined behavior.

Note: That does not mean it is safe to read and write from an over-by-one offset. You likely will be editing data that does not belong to that array, and will cause state/memory corruption. You just won't cause an overflow exception.

My guess is that it's like that because it's not only dereferencing that's wrong. Also pointer arithmetics, comparing pointers, etc. So it's just easier to say don't do this instead of enumerating the situations where it can be dangerous.

edited May 06 '12 at 20:03

Mahmoud Al-Qudsi

28,357
12
85
125

answered May 06 '12 at 19:34

Luchian Grigore

253,575
64
457
625

How is going one over the bounds fine? – Mahmoud Al-Qudsi May 06 '12 at 19:39
4

@MahmoudAl-Qudsi The standard says it's fine, that's how. – May 06 '12 at 19:40
@MahmoudAl-Qudsi it just is :) There's a lot of stuff that depends on it. Like std iterators. – Luchian Grigore May 06 '12 at 19:40
Can you clarify your post? Per §5.7 ¶5 of the C++11 spec, it's only the *expression* that is valid (i.e. guaranteed not to overflow/no exceptions), but not the result. I know you didn't say the *result* was defined, but it could easily be misconstrued as such. *"If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined."* – Mahmoud Al-Qudsi May 06 '12 at 19:49
@LuchianGrigore if you don't mind, I've edited your post. If you do mind, I can revert and make a separate answer. – Mahmoud Al-Qudsi May 06 '12 at 20:03
@MahmoudAl-Qudsi that's fine. I didn't look for the exact quote in the standard because I knew this from previous SO questions :) – Luchian Grigore May 30 '12 at 15:01
3

The result of incrementing a pointer to an array's last element is not unspecified; it is specified to a pointer just past the array's last element; subtracting from such a pointer a value from 1 to the size of the array will yield a valid pointer to an array element. – supercat Jul 09 '14 at 17:44
What about the address out-of-bounds one element before the first? I want to use the same one-past-the-end for iterating over an array in reverse order. – Levi Morrison Jun 30 '18 at 00:29

score 23 · Answer 2 · answered May 07 '12 at 09:37

The original x86 can have issues with such statements. On 16 bits code, pointers are 16+16 bits. If you add an offset to the lower 16 bits, you might need to deal with overflow and change the upper 16 bits. That was a slow operation and best avoided.

On those systems, array_base+offset was guaranteed not to overflow, if offset was in range (<=array size). But array+5 would overflow if array contained only 3 elements.

The consequence of that overflow is that you got a pointer which doesn't point behind the array, but before. And that might not even be RAM, but memory-mapped hardware. The C++ standard doesn't try to limit what happens if you construct pointers to random hardware components, i.e. it's Undefined Behavior on real systems.

score 5 · Answer 3 · answered May 06 '12 at 19:37

5

If arr happens to be right at the end of the machine's memory space then arr+5 might be outside that memory space, so the pointer type might not be able to represent the value i.e. it might overflow, and overflow is undefined.

answered May 06 '12 at 19:37

Jonathan Wakely

166,810
27
341
521

score 5 · Answer 4 · answered May 06 '12 at 19:39

"Undefined behavior" doesn't mean it has to crash on that line of code, but it does mean that you can't make any guaranteed about the result. For example:

int arr[4] = {0, 1, 2, 3};
int* p = arr + 5; // I guess this is allowed to crash, but that would be a rather 
                  // unusual implementation choice on most machines.

*p; //may cause a crash, or it may read data out of some other data structure
assert(arr < p); // this statement may not be true
                 // (arr may be so close to the end of the address space that 
                 //  adding 5 overflowed the address space and wrapped around)
assert(p - arr == 5); //this statement may not be true
                      //the compiler may have assigned p some other value

I'm sure there are many other examples you can throw in here.

`arr+5` is not one-past-the-end it's two-past-the-end, therefore it's UB according to §5.7 ¶5 and it could crash on a machine that has trap representations for pointers. — Jonathan Wakely, May 06 '12 at 20:01
That was in reply to a comment which has been deleted, please ignore the "not one-past-the-end" part. The rest still applies, it _could_ crash, but I agree that would be unusual. — Jonathan Wakely, May 06 '12 at 20:07

score 2 · Answer 5 · answered May 06 '12 at 20:53

2

Some systems, very rare systems and I can't name one, will cause traps when you increment past boundaries like that. Further, it allows an implementation that provides boundary protection to exist...again though I can't think of one.

Essentially, you shouldn't be doing it and therefor there's no reason to specify what happens when you do. Specifying what happens puts unwarranted burden on the implementation provider.

answered May 06 '12 at 20:53

Edward Strange

40,307
7
73
125

4

Systems that *can* do that are actually quite common, with Intel x86 (and compatibles) as a prime example. It's not usually *used* that way, but the x86's segment-based memory protection works as described -- it can throw an exception for even attempting to form an invalid address. Most typical OSes, however, set up all segments with a base of 0 and a limit of 4Gig, making all possible offsets valid. For what it's worth, this capability was actually used in OS/2 1.x. – Jerry Coffin May 06 '12 at 23:07
@JerryCoffin: I wish Intel had used 32-bit segment registers on the 80386, with the upper portion selecting a segment descriptor and the lower portion acting as a scaled multiplier whose behavior would be controlled by that segment descriptor. Such an architecture would have made it possible to use 32-bit object references without a four gigabyte addressing limit (the number of distinct objects would be limited to well under four billion, but their total size could be much greater). – supercat Jul 09 '14 at 17:47

score 0 · Answer 6 · answered Oct 15 '18 at 16:54

In addition to hardware issues, another factor was the emergence of implementations which attempted to trap on various kinds of programming errors. Although many such implementations could be most useful if configured to trap on constructs which a program is known not to use, even though they are defined by the C Standard, the authors of the Standard did not want to define the behavior of constructs which would--in many programming fields--be symptomatic of errors.

In many cases, it will be much easier to trap on actions which use pointer arithmetic to compute address of unintended objects than to somehow record the fact that the pointers cannot be used to access the storage they identify, but could be modified so that they could access other storage. Except in the case of arrays within larger (two-dimensional) arrays, an implementation would be allowed to reserve space that's "just past" the end of every object. Given something like doSomethingWithItem(someArray+i);, an implementation could trap any attempt to pass any address which doesn't point to either an element of the array or the space just past the last element. If the allocation of someArray reserved space for an extra unused element, and doSomethingWithItem() only accesses the item to which it receives a pointer, the implementation could relatively inexpensively ensure that any non-trapped execution of the above code could--at worst--access otherwise-unused storage.

The ability to compute "just-past" addresses makes bounds checking more difficult than it otherwise would be (the most common erroneous situation about would be passing doSomethingWithItem() a pointer just past the end of the array, but behavior would be defined unless doSomethingWithItem would try to dereference that pointer--something the caller may be unable to prove). Because the Standard would allow compilers to reserve space just past the array in most cases, however, such allowance would allow implementations to limit the damage caused by untrapped errors--something that would likely not be practical if more generalized pointer arithmetic were allowed.

score -1 · Answer 7 · answered Oct 15 '18 at 07:06

This result you are seeing is because of the x86's segment-based memory protection. I find this protection to be justified as when you are incrementing the pointer address and storing, It means at future point of time in your code you will be dereferencing the pointer and using the value. So compiler wants to avoid such kind of situations where you will end up changing some other's memory location or deleting the memory which is being owned by some other guy in your code. To avoid such scenario's compiler has put the restriction.

Why is out-of-bounds pointer arithmetic undefined behaviour?

7 Answers7

Linked

Related