25

Obviously, dereferencing an invalid pointer causes undefined behavior. But what about simply storing an invalid memory address in a pointer variable?

Consider the following code:

const char* str = "abcdef";
const char* begin = str;
if (begin - 1 < str) { /* ... do something ... */ }

The expression begin - 1 evaluates to an invalid memory address. Note that we don't actually dereference this address - we simply use it in pointer arithmetic to test if it is valid. Nonetheless, we still have to load an invalid memory address into a register.

So, is this undefined behavior? I never thought it was, since a lot of pointer arithmetic seems to rely on this sort of thing, and a pointer is really nothing but an integer anyway. But recently I heard that even the act of loading an invalid pointer into a register is undefined behavior, since certain architectures will automatically throw a bus error or something if you do that. Can anyone point me to the relevant part of the C or C++ standard which settles this either way?

Channel72
  • 253
  • 3
  • 6
  • 1
    According to C/C++ staqndard it is undefined behavior indeed. But, speaking frankly, I've never seen a real-world CPU/architecture on which the above is undefined behavior, i.e. machines that don't permit arbitrary pointer arithmetic. And I've seen quite a lot of architectures, including embedded microcontrollers. So, in my (humble) opinion, the code is ok, as long as you restrict yourself to modern non-esoteric architectures. – valdo Sep 21 '16 at 04:16
  • Can you pls extend the question - what if you have for cycle, where you traversing the array backwards? In this traversing, you definitely will need to check the element prior the first one, without dereferencing it. I had similar question but it was fo the element after the last one. – Nick Sep 21 '16 at 04:37

7 Answers7

15

I have the C Draft Standard here, and it makes it undefined by omission. It defines the case of ptr + I at 6.5.6/8 for

  • If the pointer operand points to an element of an array object, and the array is large enough, the result points to an element offset from the original element such that the difference of the subscripts of the resulting and original array elements equals the integer expression.
  • Moreover, if the expression P points to the last element of an array object, the expression (P)+1 points one past the last element of the array object, and if the expression Q points one past the last element of an array object, the expression (Q)-1 points to the last element of the array object.

Your case does not fit any of these. Neither is your array large enough to have -1 adjust the pointer to point to a different array element, nor does any of the result or original pointer point one-past-end.

Johannes Schaub - litb
  • 496,577
  • 130
  • 894
  • 1,212
  • 1
    Is this undefined or unspecified behavior. I would expect the code to run and work and have no bad consequence though weather it entered the if branch would be unknowable (via the standard). – Martin York Oct 01 '10 at 13:08
  • 1
    @Martin York: C++ standard defines this to be an Undefined behavior even if it is not dereferenced. I hope I have picked up the relevant quote in my post – Chubsdad Oct 01 '10 at 13:11
  • 1
    It is behavior which could cause a hardware fault on hardware which validates the contents of pointer registers. As such, it is Undefined Behavior. It is possible and permissible for a particular implementation to specify what will happen if programs do various things that, per the standard, evoke Undefined Behavior. If an implementation conforms to its own spec, the behavior will then be well-defined. If the code is run on a different implementation which conforms to the C standard, but not to that particular implementation's specs, however, the program may fail in arbitrary ways. – supercat Oct 01 '10 at 15:52
  • @supercat is correct: on some CPUs, loading an invalid pointer into a register will by itself crash the program, so guaranteeing that will work would disable a lot of optimizations. – Davislor Sep 21 '16 at 04:17
  • @Lorehead: A guarantee that certain actions will have no side-effects on the target platform will allow a compiler to optimize out operations that would otherwise be needed to prevent them; if a compiler passes any such guarantees it receives through to the programmer, the programmer may be able to optimize out additional operations the compiler couldn't. Passing such guarantees through to the programmer would require that the compiler refrain from certain optimizations, but in cases where the programmer would have been able to exploit the guarantees, their valid will likely exceed... – supercat Sep 21 '16 at 14:20
  • ...the value of "optimizations" that rely upon their absence. If e.g. a programmer knows that a particular action would be harmless in 9 of the 10 places it would be performed, being able to omit 9 out of 10 checks that would prevent it may improve efficiency without impairing correctness, but not if the only way to ensure the compiler generates the code for the necessary one is to include the checks in the nine unnecessary cases as well. – supercat Sep 21 '16 at 14:22
  • @supercat Let me give a (pathological) example of what I mean. Architecture with 32-bit registers and 48-bit pointers that consist of 16-bit segment selectors and 32-bit offsets; segments refer to separate memory spaces with different permissions managed by the OS. (i386 kind of supports this.) Adding and subtracting segments is a meaningless operation, which will at best yield an invalid or impermissible selector. Furthermore, even loading such an invalid selector into a segment register causes a hardware fault. Converting a pointer to an arbitrary pair of integers and back is expensive. – Davislor Sep 21 '16 at 18:14
  • @Lorehead: Guaranteeing that computations involving invalid pointers won't trap would be expensive *on platforms where they naturally would trap*, independent of "optimization". Such platforms are rare, however. On hardware that would never trap such operations, a compiler that guaranteed that such computations will be free of side-effects would have to forfeit some "optimizations" but the value of such guarantees to code that can exploit them is often far greater than the value of forfeited "optimizations". – supercat Sep 21 '16 at 19:29
  • @supercat I’m not sure what part of that you take me as disagreeing with. My point was that even an architecture where pointers trap could do a bunch of work to make code run as expected even when it assumes that pointers fit in general-purpose registers and you can do integer math on them. In my example, `size_t` would be a 32-bit word, but `ptrdiff_t` and `intptr_t` would store pointers in two words and do `long long int` math on them. But this would be less efficient, on that architecture, than testing the selectors for equality and doing ALU operations only on the offsets. – Davislor Sep 21 '16 at 21:28
  • 2
    @Lorehead: The modern usage of the term "optimization" refers to the notion that a compiler should aggressively identify situations that would invoke UB, and conclude that variables cannot hold values that would make such situations arise. For example, given the code `if (p != 0) doSomething(p); debug_log(*p);` a "modern" optimizing compiler could conclude that it was safe to make the call to `doSomething` unconditional since code would invoke UB if "p" is null, even if on the target platform reading a null pointer would simply yield a meaningless value. – supercat Sep 22 '16 at 00:16
11

Your code is undefined behavior for a different reason:

the expression begin - 1 does not yield an invalid pointer. It is undefined behavior. You are not allowed to perform pointer arithmetics beyond the bounds of the array you're working on. So it is the subtraction itself that is invalid, and not the act of storing the resulting pointer.

jalf
  • 243,077
  • 51
  • 345
  • 550
  • 1
    The C99 Rationale (linked to in my answer) specifically mentions pointer arithmetic beyond the bound of the array as yielding invalid pointers. – fizzer Oct 01 '10 at 12:24
  • If the expression was modified to `(ptrdiff_t)begin - 1`, would that still yield undefined behavior? Since ptrdiff_t has to be a signed integral type, I would think this would be okay. – Channel72 Oct 01 '10 at 12:26
  • 2
    A ptrdiff_t may only be calculated for two pointers into the same data object. The only exception to the "within the bounds of the array" is a pointer *one* beyond the *end* of the array. – DevSolar Oct 01 '10 at 13:13
  • @fizzer: I don't have the C++ standard here (formatted my computer a few days ago, and still need to grab that from my backups), but it states that this is undefined. I don't know if C does it differently, but I'd imagine that it's just that rationale deals with what *actually* happens (in reality, you just get an invalid pointer), but the standard is more strict and says "it's a nonsensical operation, it is undefined". – jalf Oct 01 '10 at 13:19
  • @Channel72: Yes, as long as the following are all true: (1) `sizeof(ptrdiff_t) >= sizeof(void*)` (this isn't necessarily guaranteed), (2) the result of casting `begin` to the signed integer type `ptrdiff_t` doesn't result in the minimum value representable by that type (if it does, then the subtraction will result in undefined behavior), and that (3) the implementation defines conversion of a pointer to an integer consistently so that you can compare the result of comparing the result of this expression with the result of `(ptrdiff_t)str` and get a meaningful result (also not guaranteed). – James McNellis Oct 01 '10 at 13:20
  • And (4), the result of the cast results in a value that is representable by a `ptrdiff_t` (the result of the cast might exceed the maximum value representable by a `ptrdiff_t`) [Those are for C, where there is an implicit conversion from pointer to integer; at least that's my understanding of it. I'd think the same is true for C++; the problem is that converting a pointer to an integer has implementation-defined results.] – James McNellis Oct 01 '10 at 13:22
  • @fizzer: The rationale gives a sound reason why implementations should be allowed to trap even on things like comparisons of formerly-valid pointers, but I wonder if anyone has proposed having the standard a macro like __POINTER_EXTENSIONS which implementations should set to a value conceptually somewhat like the __FP_EVAL_MODE, which would indicate what kinds of operations an implementation could support beyond those required by the standard. Many implementations can offer guarantees about how pointers behave which go far beyond what the standard requires, and algorithms which can use... – supercat May 29 '15 at 17:37
  • ...those guarantees can often be more efficient than algorithms which must work around the lack of them. It would be very difficult, for example, to write an efficient operating system without a means of testing whether a pointer identifies part of an object identified by a base pointer and size, but *most* hardware platforms would have no problem making `ptr1 >= base && ptr1 < base+size` indicate that. – supercat May 29 '15 at 17:41
  • @supercat why would an OS need that? The OS doesn't go around testing the objects you create in your program... – jalf May 30 '15 at 10:34
  • @jalf: Consider the design of `malloc` and `free` themselves. One could design a malloc/free system which stored enough information in each block's header that it wouldn't have to "search" for the block, but if there are many small objects such overhead could be severe. In many cases, it's more practical to have the header store enough information that the blocks' relationship to other blocks can be discovered without too much work, but some algorithms for that require comparisons among unrelated pointers. For that matter, consider `memmove`. One could design... – supercat May 30 '15 at 16:04
  • ...a `memmove` implementation which started from the front and checked for collisions before writing each byte, but it's much cleaner and easier to simply say that if `dest>souce`, copy top-to-bottom and otherwise copy bottom-to-top. If the pointers are unrelated, it won't matter which method of copying is chosen, provided only that the compiler doesn't use the "undefined behavior" as an excuse to do something annoying like omitting the copy operation altogether. – supercat May 30 '15 at 16:06
  • @supercat the semantics you're asking for are effectively what's provided by `std::less`. :) (and additionally, of course, the OS generally isn't bound by the rules of C++. It is free to provide additional implementation-dependent guarantees) – jalf May 30 '15 at 16:16
  • @jalf: I don't think the authors of Unix had access to std::less. Further, having a hardware platform that can provide functionality won't do any good if a compiler decides to excise it. Further, while the purpose of C was to facilitate Unix, there are many kinds of libraries that can benefit from various pointer-related guarantees. While it's possible in many cases to use command-line options to ask compilers to let programmers use them, there's no standard way by which code can check whether a compiler is being invoked in such fashion as to provide the guarantees it needs. – supercat May 30 '15 at 16:38
  • @jalf: If a compiler supports command-line options to ensure consistent behavior in situations not defined by the Standard, having a compiler also define macros to indicate the effects of those options should be trivial by comparison. Further, many older compilers without such options could be made complaint merely by having a compiler (or makefile) predefine macros defining the compiler's behavior. Personally, given `uint32_t n;` I see nothing to be gained by saying that while `n*=n;` is required to behave the same on all platforms where `int` or `long` is 32 bits as it is on... – supercat May 30 '15 at 16:57
  • You're moving the goalposts. First, you say "An OS *requires* this feature. Then you give an example of how `malloc` *could* be implemented using such a feature. And now you're giving an example of an OS that happened to not use this particular C++ feature because it was not written in C++. It's getting a bit hard to see where you're going with this. C++ should support a feature it already supports because some OS'es weren't written in C++, and were able to depend on implementation-defined behavior that made the feature unnecessary at the language standard level *anyway*? – jalf May 30 '15 at 17:51
  • What is "to be gained" from these rules for pointers in C++ is generality and efficiency. It allows for simpler, more efficient pointer comparisons on some architectures (say, ones with a segmented memory model, where being able to make these simplifying assumptions allows for a simpler implementation of pointer arithmetics and pointer comparisons. – jalf May 30 '15 at 17:55
  • @jalf: Architectures where relational operators on unrelated pointers would be expensive could comply with my proposed standard easily by simply setting the __POINTER_EXTENSIONS macro to indicate that it offers no standardized extensions to the features mandated by the standard. Existing code which was written for a compiler where relational operators simply "work" and exploits that fact, and which *probably* won't need to be targeted to an architecture where they wouldn't simply work, could be made "safe" by adding a check for the extensions it requires. – supercat May 30 '15 at 23:14
  • If code is ever going to run on architectures where `<` couldn't compare unrelated pointers as easily as related ones, then changing it to use `std::less` instead of `<` would be an improvement, but if--as would more likely be the case--it will never have to run on such architectures such a change would be a waste of time. Further, I like trying to write code which will work in either C or C++; I consider the divergence of the languages unfortunate since even code which doesn't use polymorphic objects could benefit from a number of C++ features. – supercat May 30 '15 at 23:19
  • I've not really done any "C++ as C++" programming; the only C++ programming I've done has been to code an emulation layer so that I can run the same code in an 8-bit embedded micro and on a PC (since the PC's debugging facilities are much nicer than those of the micro). C++ makes it possible to define a type that behaves like an embedded systems' 16-bit `unsigned int` where multiplying 0xFFFD by 0xFFFD yields 0x0009 [not UB]; while I think C should provide such types (they'd greatly facilitate migration of a lot of older-platform code to modern platforms) at present only C++ can do so. – supercat May 30 '15 at 23:24
8

Some architectures have dedicated registers for holding pointers. Putting the value of an unmapped address into such a register is allowed to crash. Integer overflow/underflow is allowed to crash. Because C aims to work on a broad variety of platforms, pointers provide a mechanism for safely programming unsafe circuits.

If you know you won't be running on exotic hardware with such finicky characteristics, you don't need to worry about what is undefined by the language. It is well-defined by the platform.

Of course, the example is poor style and there isn't a good reason to do it.

Potatoswatter
  • 134,909
  • 25
  • 265
  • 421
  • 1
    The fact that it's well-defined on a platform does not make it well-defined on all implementations targeting that platform. Compilers whose writers who are more interested in "optimization" than in supporting low-level programming cannot be relied upon to behave reliably with such code even if the underlying platform would. – supercat Jul 21 '17 at 15:26
  • @supercat That's a good point, and you're technically correct. In practice, though, when a compiler get so aggressive that `ptr = arr - 1;` becomes a no-op (or crash, or…), its users just might get so upset that they go find other compilers. While the standard allows it, such behavior is so subtly pathological and such computations are so common that it's seldom a viable solution. – Potatoswatter Jul 25 '17 at 19:44
  • Compilers like gcc and clang seem popular, even though their behaviors would have been considered outrageous in saner times. One of the reasons the authors of the Standard made short unsigned types promote as signed was, according to the rationale, that the majority of then-current implementations would process something like `unsigned mul_mod_65535(unsigned short x, unsigned short y) { return (x*y) & 0xFFFF; }` in the logical fashion even if `x*y` was larger than `INT_MAX`. GCC, however, will sometimes "optimize" that function in ways that break. – supercat Jul 25 '17 at 19:54
  • @supercat Yes, that's another perennial source of complaints. Still, it's easier to catch that sort of bug. Out-of-bounds computations are sometimes hard to avoid and difficult to see in the code. C++ is introducing `std::launder` to selectively bless such values, but actually specifying that function has been about as weird as you might expect. – Potatoswatter Jul 25 '17 at 20:00
  • A major problem with that sort of thing in C has been that compilers which were suitable for low-level programming would provide the necessary semantics without directives, and programs that could run on heavily-optimizing compilers didn't need such semantics. Had the Standard defined such semantics, compilers suitable for low-level programming could have simply ignored them when they weren't required, but provided optimization modes for use with programs that marked all the places they did anything "tricky". – supercat Jul 25 '17 at 20:26
  • A further problem with C's aliasing rules is that they are based upon the dynamic contents of memory, rather than upon static aspects of program structure. If the rules had specified that when a pointer is cast from `T*` to `U*`, such a cast creates a "window" during which the pointer may be used to access things of type `T*` or `U*`, such rules could allow more optimizations than are allowable under current rules while also allowing the use of a lot of code that would otherwise require `-fno-strict-aliasing`. – supercat Jul 25 '17 at 20:31
  • Could you name one architecture of the type you mention in your answer please? – Evg Sep 23 '20 at 12:44
  • 1
    @Evg m68k has address registers and I’m not 100% sure but the unmapped address load comment was probably referring to IA64. – Potatoswatter Sep 23 '20 at 13:37
4

Any use of an invalid pointer yields undefined behaviour. I don't have the C Standard here at work, but see 'invalid pointers' in the Rationale: http://www.open-std.org/jtc1/sc22/wg14/www/C99RationaleV5.10.pdf

fizzer
  • 13,551
  • 9
  • 39
  • 61
  • If that's the case, couldn't you just cast all your pointers to a `ptrdiff_t` when doing pointer arithmetic? In other words, if I changed my above code sample to read `if ((ptrdiff_t)begin - 1)` would that no longer be undefined behavior? – Channel72 Oct 01 '10 at 12:09
  • Not undefined behaviour, but the result is implementation defined. That is, your implementation will document some reasonable behaiour, but it will not be portable, and may not be useful. – fizzer Oct 01 '10 at 12:30
  • The comp.lang.c FAQ addresses this: http://c-faq.com/ptrs/int2ptr.html. Like I said, I don't have the Standard to hand. – fizzer Oct 01 '10 at 12:32
  • 3
    Note that ptrdiff_t will hold the *difference between* pointers, not pointers themselves. This is not the same thing. – fizzer Oct 01 '10 at 12:36
2

$5.7/6 - "Unless both pointers point to elements of the same array object, or one past the last element of the array object, the behavior is undefined.75)"

Summary, it is undefined even if you do not dereference the pointer.

Chubsdad
  • 24,777
  • 4
  • 73
  • 129
  • 1
    That text concerns subtraction of a pointer from a pointer; the OP is subtracting an integer from a pointer. – James McNellis Oct 01 '10 at 13:13
  • @James McNellis: That's about pointer arithmetic I guess. Ultimately it's about the resultant pointer value – Chubsdad Oct 01 '10 at 13:15
  • I am unsure about your reasoning, subtracting two pointers from different arrays you might in fact have issues because the pointers point to different memory zones (think far / near memory in 16bits architecture). There is nothing here about meddling with the pointers themselves, in fact it is quite common to use the upper bits of 64-bits pointers to store additional flags. – Matthieu M. Oct 01 '10 at 14:33
1

The correct answers have been given years ago, but I find it interesting that the C99 rationale [sec. 6.5.6, last 3 paragraphs] explains why the standard endorses adding 1 to a pointer that points to the last element of an array (p+1):

An important endorsement of widespread practice is the requirement that a pointer can always be incremented to just past the end of an array, with no fear of overflow or wraparound

and why p-1 is not endorsed:

In the case of p-1, on the other hand, an entire object would have to be allocated prior to the array of objects that p traverses, so decrement loops that run off the bottom of an array can fail. This restriction allows segmented architectures, for instance, to place objects at the start of a range of addressable memory.

So if the pointer p points to an object at the start of a range of addressable memory, which is endorsed by this comment, then p-1 would generate an underflow.

Note that integer overflow is the standard's example for undefined behavior [sec. 3.4.3], as it depends on the translation environment and the operating environment. I believe it is easy to see that this dependence on the environment extends to pointer underflow.

This is why the standard explicitly makes it undefined behavior [in 6.5.6/8], as noted by other answers here. To cite that sentence:

If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.

See also [sec. 6.3.2.3, last 4 paragraphs] of the C99 rationale, which gives a more detailed description of how invalid pointers can be generated, and what effects that may have.

Orafu
  • 311
  • 2
  • 5
0

Yes, it's undefined behavior. See the accepted answer to this closely related question. Assigning an invalid pointer to a variable, comparing an invalid pointer, casting an invalid pointer triggers undefined behavior.

Community
  • 1
  • 1
sharptooth
  • 167,383
  • 100
  • 513
  • 979