0

Let's say I have the following array

unsigned char array[5];

So I'm trying to understand what exactly is undefined in accessing

array[6]

for my understanding, the compiler will try to read the value inside *(array+6), so it is defined. So, if I understand right, what is undefined is what happens when you try to read uninitialized memory.

If this is the case, then the top answer here says that reading uninitialized value is defined unless you have trap representation, but we know that unsigned char does not have that.

So is it defined well here? Will it always read the *(array+6)?

Yunnosch
  • 26,130
  • 9
  • 42
  • 54
Moshe Levy
  • 174
  • 9
  • 3
    It's UB because you access the array out of bounds, as simple as that. `array+6` would also be UB, I believe, since the formed pointer would be invalid. – HolyBlackCat Feb 06 '22 at 14:03
  • The memory doesn't even have to exist... – Andrew Henle Feb 06 '22 at 14:07
  • @HolyBlackCat so generating such a code is UB? or only when you access the memory is UB?, will we always get a code that's try to read the address? if only when you try to access the code is UB, then its the same as accessing an uninitialized memory, and that's not UB for all cases – Moshe Levy Feb 06 '22 at 14:12
  • Executing this code causes UB. More specifically, when the out-of-bounds access becomes inevitable, the behavior is undefined (i.e. UB can travel back in time). – HolyBlackCat Feb 06 '22 at 14:49
  • what if `array` lies at `0x341265431FFFFFFFB`? `array[1]` at `0x341265431FFFFFFFC`... and `array[4]` at `0x341265431FFFFFFFF`... that would make (the hypotetical) `array[5]` be at `0x34126543200000000` which may be invalid :) – pmg Feb 06 '22 at 14:51
  • Related: [1](https://stackoverflow.com/questions/9137157), [2](https://stackoverflow.com/questions/6452959), [3](https://stackoverflow.com/questions/1239938), [4](https://stackoverflow.com/questions/11551472), [5](https://stackoverflow.com/questions/15646973), [6](https://stackoverflow.com/questions/55692816), [7](https://stackoverflow.com/questions/57247807), [8](https://stackoverflow.com/questions/57930992). – Steve Summit Feb 06 '22 at 15:05
  • @SteveSummit i do not find it related much, I tried to understand what exactly is undefined here, is it the memory access that may cause undefined behavior or such a code. – Moshe Levy Feb 06 '22 at 15:09
  • OP is assuming the memory just after the array is both uninitialised and accessible. It might be uninitialised, but it might be initialised too. It might trigger an access violation, it might not. – Cheatah Feb 06 '22 at 15:10
  • @MosheLevy Sorry, I meant to say "Possibly related". – Steve Summit Feb 06 '22 at 15:24

3 Answers3

6

I'm trying to understand what exactly is undefined in accessing

array[6]

Per C 2018 3.4.3, “undefined behavior” means the C standard imposes no requirements. So, undefined behavior is not a specific thing; it means there is a complete lack of the C standard saying what should or should not happen.

array[6] is undefined because arithmetic outside of array bounds is not defined. array[6] is specified to be equivalent to *(array+6), and addition to a pointer is specified in C 2018 6.5.6 8. It says that additions that result in pointing to an element in the array or one beyond the last element are defined, but “otherwise, the behavior is undefined.” So array[6] has undefined behavior because the C standard explicitly says so.

If we consider that other case, pointing one beyond the last array element, then array+5 is defined. However *(array+5) is not defined, because C 2018 6.5.6 8 tells us that, while array+5 points to the location just beyond the last element of the array, the * operator shall not be applied to it:

If the result points one past the last element of the array object, it shall not be used as the operand of a unary * operator that is evaluated.

In this context (a paragraph of the standard that is not one of the “Constraints” paragraphs), violating a “shall not” rule means the behavior is undefined, because C 2018 4 2 says:

If a "shall" or "shall not" requirement that appears outside of a constraint or runtime-constraint is violated, the behavior is undefined…

So, exactly what causes this to be undefined is that it violates an explicit rule of the standard, and violating that rule means the behavior is undefined.

for my understanding, the compiler will try to read the value inside * (array+6) so it is defined,…

No, *(array+i) refers to element i of the array array when the behavior is defined. When the behavior is not defined, it does not have any meaning.

… so if I understand right, what is undefined is what happens when you try to read uninitialized memory.

The rules around this are complicated, and they are not actually relevant here. Since *(array+6) is undefined, it does not mean to read memory. It has no meaning as far as the C standard is concerned.

When the compiler is compiling code and encounters *(array+6) used as a value, it will ordinarily generate instructions to get the value of element 6 of the array. By “ordinarily,” I mean when the behavior is defined. However, modern compiler are not simplistic code generators. They perform semantic analysis and optimization of the program. Instead of reading code and generating instructions, they read code and generate a semantic representation of the program. That semantic representation and the rules of the C standard are operated on by optimizing software. Once the optimizing software is done, then instructions are generated. In this process, the semantic analyzer may determine that *(array+6) has no meaning, and then the optimizer will strip it away. It will not generate code to load from the memory that array+6 would reference if things were different.

This is useful when there is code such as:

if (some test)
{
    x = array[6];
    various code;
}
else
{
    x = array[3];
    various code;
}

When the compiler can deduce in a certain situation that array[6]is not defined, then it can conclude that the only defined path through this code is if some test is false. It can then generate the code for the “else” case and remove the code for the “then” case.

Logically, this is equivalent to “Since the behavior of the code in the ‘then’ case is not defined, it can be anything. If we choose to make it the same as the code in the ‘else’ case, then we can use the same instructions for both cases, so we do not need to branch based on some test and will have smaller code. This conforms to the C standard, because it does not define what the ‘then’ case does.”

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312
  • thanks for the great explanation, a related question, what happens if I try to read some random memory? int* x = 1000 and then *x? – Moshe Levy Feb 06 '22 at 14:39
  • 1
    @MosheLevy: First, the compiler must diagnose a problem in `int *x = 1000;` because the constraints for `=` do not allow assigning an integer to a pointer. You can fix that with `int *x = (int *) 1000;`. Then the conversion is implementation defined. At that point, if `(int *) 1000` is not the address of an `int` object (or one of certain closely related types), then `*x` has undefined behavior as far as the C standard is concerned, and the standard would allow the compiler to behave as described in this answer. – Eric Postpischil Feb 06 '22 at 14:46
  • @MosheLevy: However, if you assign an address from an integer in this way, the compiler may treat it as if you know what you are doing and it is a valid address. This goes beyond what the C standard requires; it may be designed into some compilers because it is a pattern used to access certain things at fixed places in memory (like memory-mapped device control registers in certain hardware) or the “common page” that macOS maps at a fixed address for all processes. – Eric Postpischil Feb 06 '22 at 14:47
  • so if we take unsigned char instead, will this code always will be valid, and will we always will access the address, so the behavior is defined?(probably we will get segfault, but that's not my question) – Moshe Levy Feb 06 '22 at 14:55
  • @MosheLevy: I am not sure where you are getting this `unsigned char` thing from. Perhaps from the rules about accessing uninitialized object, but those are different from the rules about accessing memory that is not part of a defined object. If you access an array outside of its bounds, the C standard does not define the behavior, regardless of the type. If you use the value of an object, whether an array element or something else, so the access to memory is defined, but the object is not initialized, then the behavior has some complicated rules. – Eric Postpischil Feb 06 '22 at 15:18
  • im talking about the example in the comment, if I have unsigned char* x = (unsigned char*)1000, and then *x. here I assume that x indeed an address of unsigned char*, so will the code be reading the value inside 1000? or it may be undefined? – Moshe Levy Feb 06 '22 at 15:20
  • 1
    @MosheLevy: If the compiler supports using 1000 as an address, then using `*x` as a value will read the memory at address 1000. – Eric Postpischil Feb 06 '22 at 15:38
  • @user17732522: Yes, off by one. I will update when I have some more time. – Eric Postpischil Feb 06 '22 at 15:58
2

It seems that you are confusing "undefined behaviour" with "unclear address" or "accessing non-initialised parts of an array".
The problem here is not that the memory which you would access is non-initialised.
The problem here is not that it is unclear what array[6] should mean.
The problem is that the assumption of non-initialised memory existing at that address is no allowed and the assumption that it is possible to access that memory is not allowed.
Compilers are given as much leeway for optimisations as possible. Most models of memorys are only that, models.
C standard requires in many cases that an executable created by a compiler etc. should "act as if ...", without ever defining to actually "do ...".
The result is that there are many examples of Undefined Behaviour (capitals, because this is a more specific term than "the way a program acts in certain situations which nobody did bother to clearly explain"). They are very important for minimising restrictions which would impede optimisations.
So in short (and somewhat blunt), if something is UB, then stop thinking about it. Many answers here on StackOverflow reach a point of explanation where they state "UB, end of explanation". Some then go on, but only after making explicit that they are speculating on what might have happened in the specific observed case.

Yunnosch
  • 26,130
  • 9
  • 42
  • 54
1

Quite simply, what you're doing is undefined behavior because the C standard says it is.

From section 6.5.6p8 regarding additive operators:

... If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined. ...

So given that array has 5 elements, array + 6 would point more than one element past the end of the array, so creating such a pointer is undefined even if it is not dereferenced.

As to what can happen in a particular implementation, some possibilities are:

  • You get a pointer to something two bytes past the end of the array. This could be some other object, padding between objects, or part of the return address for the current function.
  • The result points to memory that hasn't yet been mapped into the process's memory space, and attempting to dereference the pointer can result in a segmentation fault.
  • The compiler can perform optimizations that assume that undefined behavior does not exist and something else may happen.

These are just a few examples of how undefined behavior can manifest.

The bottom line: the C standard says not to do it, so don't do it.

dbush
  • 205,898
  • 23
  • 218
  • 273