3

This question may be a bit controversial. I have a following code at block scope:

int *a = malloc(3 * sizeof(int));
if (!a) { ... error handling ... }
a[0] = 0;
a[1] = 1;
a[2] = 2;

I argue that this code invokes UB due to pointer arithmetics outside of bounds. The reason is that the effective type of the object pointer by a is never set to int[3] but rather int only. Therefore any access to the object at an index other than 0 is not defined by C standard.

Here is why:

Line a = malloc(...). If the allocation succeeds thena points for a region large enough to store 3 ints.

a[0] = ... is equivalent to *a = ..., an l-value of int. It sets the effective type of the first sizeof(int) bytes to int as indicated in the rule 6.5p6.

... For all other accesses to an object having no declared type, the effective type of the object is simply the type of the lvalue used for the access.

Now the pointer a points to an object of type int, not int[3].

a[1] = ... is equivalent to *(a + 1) =. Expression a + 1 points to an element one after the end of int object accessible through *a. This pointer itself is valid for comparison but accessing is undefined due to:

Rule 6.5.6p7:

... a pointer to an object that is not an element of an array behaves the same as a pointer to the first element of an array of length one with the type of the object as its element type.

And rule 6.5.6p8:

... If the result points one past the last element of the array object, it shall not be used as the operand of a unary * operator that is evaluated.

The similar issue is relevant for a[2] = ... but here even a + 2 hidden in a[2] invokes UB.

The issue could be resolved if the standard allowed arbitrary pointer arithmetic with the valid region of memory as long as alignment requirements and strict aliasing rule is satisfied. Or that any collection of the consecutive objects of the same type can be treated as an array. However, I was not able to find such a thing.

If my interpretation of the standard is correct then some C code (all of it?) would be undefined. Therefore it is one of those rare cases when I hope that I am wrong.

Am I?

tstanisl
  • 13,520
  • 2
  • 25
  • 40
  • You're correct that `a` doesn't point to an object of type `int[3]`. One reason is that a pointer to `int[3]` would have the type `int (*)[3]` which is very different from the type of `a`. Instead it says that `a + i` (for any valid index `i`, including `0`) is pointing to an `int`. – Some programmer dude Dec 01 '21 at 13:31
  • @Someprogrammerdude, ok I meant that it "does not point to the first element of array of type `int[3]`" – tstanisl Dec 01 '21 at 13:33
  • 6
    _7.22.3 Memory management functions_ ".... and then used to access such an object or **an array of such objects** in the space allocated ..." is probably relevant. That usage of malloc is all over the place in C, you're overthinking this. – Mat Dec 01 '21 at 13:33
  • 1
    The effective type and strict aliasing rules are plain broken and this is one such example. However, the rule about pointer arithmetic only being allowed within an array is equally broken, whenever applied to a chunk of data of unknown (effective) type. You get the same problems whenever doing pointer arithmetic on for example a map of hardware registers in a microcontroller. The C standard doesn't generally acknowledge that there can be things placed in the address space which were not placed there by a C compiler. – Lundin Dec 01 '21 at 13:34
  • 1
    @Mat, yes, I'm overthinking, but *language-lawyer* tag is exactly for overthinking things. The wording from `7.22.3` looks relevant but it is contradicting with other more explicit rules. – tstanisl Dec 01 '21 at 13:39
  • 1
    @Mat Rather, whoever came up with the rules of effective type were "underthinking" this. They don't address arrays/aggregate types nor do they address type qualifiers. The whole of 6.5 §6-§7 can be replaced with "here the implementation can puzzle things together between the lines as it pleases, in an undocumented manner". All of this boils down to quality of implementation in the end. – Lundin Dec 01 '21 at 13:40
  • @dbush Nah, so far no strict-aliasing tag. Also I've already done sufficient ranting about how bad these rules are myself :) – Lundin Dec 01 '21 at 13:48
  • @Lundin, could you share the link to the rant? – tstanisl Dec 01 '21 at 13:49
  • _Therefore any access to the object at an index other than 0 is not defined by C standard … `a[0] = ...` is equivalent to `*a = ...`_ If you wanna be consistent, `a[0]` is equivalent to `*(a + 0)` and `a + 0`, in turn, requires `a` to point to an array element to have defined behavior, there is no exception for adding 0. – Language Lawyer Dec 01 '21 at 13:50
  • @LanguageLawyer, nope, `(int*)NULL + 0` is perfectly valid – tstanisl Dec 01 '21 at 13:51
  • _`(int*)NULL + 0` is perfectly valid_ Not in C – Language Lawyer Dec 01 '21 at 13:52
  • @LanguageLawyer, mhm.. it look that C++ allows it (https://stackoverflow.com/a/59409094/4989451). It would be surprising if C did not but you may be right. So now even `a[0]` is *UB*? – tstanisl Dec 01 '21 at 13:55
  • @Mat: It is not possible to “overthink” a question with the language-lawyer tag. Among the goals for language-lawyer discussions would be moving toward a formal mathematical specification of the language, so getting **every** detail **exactly** correct is relevant. – Eric Postpischil Dec 01 '21 at 14:00
  • _it look that C++ allows it_ I know. _It would be surprising if C did not but you might be right_ Why speculate when you can check (http://port70.net/~nsz/c/c11/n1570.html#6.5.6)? _So now even `a[0]` is UB?_ If you wanna be consistent, you should consider it so. – Language Lawyer Dec 01 '21 at 14:00
  • @LanguageLawyer In this case 6.5.6/7 applies: "For the purposes of these operators, a pointer to an object that is not an element of an array behaves the same as a pointer to the first element of an array of length one with the type of the object as its element type." Meaning `a[0]` is well-defined but `a[1]` is strictly speaking not covered by the C standard. – Lundin Dec 01 '21 at 14:18
  • @Lundin _In this case_ Which one? `(int*)NULL + 0`? BTW, /8 doesn't say that «array object» shall be an object having effective array type, it can just have array type. (I don't wanna say that _The effective type and strict aliasing rules are plain broken_ is wrong. I'm not sure that _this is one such example_) – Language Lawyer Dec 01 '21 at 14:23
  • @Lundin Ah, ok, I see what you wanna say. You, for some reason, assume that `a` points to an object of `int` type (which is not an element of an array). Then /7 will, ofc, apply. The thing is, I don't assume that `a` point to an object of `int` type. Why would it? Just because we casted a «valid» pointer value to `int*`? By the same logic we can say it points to the first element of array of 3 `int`s just because we are adding `2` to it. – Language Lawyer Dec 01 '21 at 14:33
  • @LanguageLawyer Because the chunk returned by malloc has no effective/declared type. "If a value is stored into an object having no declared type through an lvalue having a type that is not a character type, then the type of the lvalue becomes the effective type of the object for that access and for subsequent accesses that do not modify the stored value." Nowhere does this rule consider arrays, so unless the access is done through some `*( int(*)[n] )` type, what is there to turn the chunk with no declared/effective type into effective type `int[n]`. – Lundin Dec 01 '21 at 14:43
  • @Lundin I got what you mean. If you wanna say that there shall be effective array type for /8 to apply, then there shall be effective `int` type for /7 to apply, shan't it? Or it would be nice to hear why effective type matters for /8 but not for /7 – Language Lawyer Dec 01 '21 at 14:50
  • @LanguageLawyer So what you are saying is that we can't do `a[0] = 0;` because at that point (before this expression has been executed), the item pointed at by `a` has no declared/effective type? – Lundin Dec 01 '21 at 14:54
  • @Lundin Well, sorta. Not that I really wanna say something concrete about definedness of `a[0]`, just asking about what, to me, looks like inconsistence and dual standards in reading of 6.5.6 p7 and p8 – Language Lawyer Dec 01 '21 at 15:05

1 Answers1

2

The Standard only "halfway" defines the term "object": it says that every object is a region of storage, but it does not specify when a region of storage is or is not an object. For most of the Standard, it would be fine to say that every region of storage simultaneously contains all objects of all types that will fit therein; any action which modifies an object modifies the underlying storage, and any action which modifies the underlying storage modifies the stored value of all objects therein.

I think it's fairly clear that the authors of the Standard expected that in cases where the Standard says an action invokes Undefined Behavior, but the behavior would be defined in the absence of that statement, quality implementations should behave in the defined fashion in cases where their customers would find that useful. The question of which cases those are, however, is a Quality of Implementation issue outside the Standard's jurisdiction. As such, it didn't really matter if the Standard characterized as Undefined Behavior some action which all implementations to date had processed in the same obviously-useful fashion, because nobody seeking to sell compilers would interpret the Standard's failure to mandate such a behavior as an invitation to deviate from it in ways that would be detrimental to their customers.

Because different compilers are used for different purposes, the only way the Standard could actually define all the behaviors which would be needed for many low-level programming tasks while also allowing all of the optimizations that would be useful for high-end number crunching would be to either recognize categories of implementations that make different optimizations, or add better means of inviting or blocking optimizations that would usefully improve performance and/or result in incorrect program behavior. Because every compiler that has ever existed or will plausibly ever exist will refrain from making some optimizations that would otherwise have been useful, and/or perform "optimizations" which incorrectly process some Strictly Conforming C11 programs, the question of whether the Standard would allow a silly optimization should only be relevant to people who either want to write poor quality compilers, or who want to bend over backward to be compatible with them.

supercat
  • 77,689
  • 9
  • 166
  • 211
  • *because nobody seeking to sell compilers would interpret the Standard's failure to mandate such a behavior as an invitation to deviate from it*... optimizing compilers are not far from that when they take advantage of potential undefined behavior to generate counter intuitive optimisations and break existing code that was not fully defined but ran fine with previous state of the art. – chqrlie Dec 04 '21 at 23:07
  • 1
    @chqrlie: Perhaps I should have re-amplified the part of the text I'd italicized above: ...to deviate from it *in any way that would make the compiler less useful for their customers*. For most purposes a compiler that can meaningfully process a wide range of non-portable programs would be more useful than one that could not. Given `float *floatPtr`, there is no reason why a quality compiler should, absent some unusual configuration options, assume that an access to `*(unsigned*floatPtr` wouldn't access an object of type `float`. Actually, if one recognizes the principle that... – supercat Dec 05 '21 at 17:57
  • ...an access made via lvalue whose address is freshly visibly derived from one of a particular type should be recognized as *being* an access of that type in cases where the latter would be defined, but left the meaning of "freshly visibly derived" as a Quality of Implementation issue, that would be much more workable for programmers and most compiler writers alike, at least for people who aren't having to maintain compilers whose front-ends strip out information necessary to support such constructs. – supercat Dec 05 '21 at 18:01
  • So is the answer to question that it's the example of a "technical UB"? A kind of UB that all relevant/useful implementations of C define in the same way. It looks like some kind of defect in the standard. – tstanisl Dec 08 '21 at 22:22
  • @tstanisl: The Standard was never intended to describe all of the situations in which implementations claiming suitability for any particular purpose should be expected to behave usefully. The fact that it doesn't do so isn't really a defect. The primary failing is its failure to make clear that it waives jurisdiction over many *correct but non-portable programs*, and that waiver of jurisdiction over a program's behavior does not imply any judgment that the program should be viewed as "erroneous" or "broken". – supercat Dec 08 '21 at 22:59
  • @tstanisl: As a consequence of the latter failing, a philosophy has emerged, and been embraced by the maintainers of clang and gcc, that no particular response to non-portable programs or erroneous data should be regarded as inferior to any other, and the only thing that matters is how well an implementation can process portable programs given correct data. – supercat Dec 08 '21 at 23:01