38

Consider the following code:

int a[25][80];
a[0][1234] = 56;
int* p = &a[0][0];
p[1234] = 56;

Does the second line invoke undefined behavior? How about the fourth line?

fredoverflow
  • 256,549
  • 94
  • 388
  • 662
  • as array bounds are not checked so it should no give any error! – teacher Sep 01 '11 at 10:40
  • int a[25][80] is allocating a memory of 80*25*4 bytes, and its a contiguous memory allocation so if you are accessing a[0][1234] actually you are accessing memory at 1234 from base address! line 4 is not giving you error is because a[0][1234] = *((*a)+1234) =p[1234]; – teacher Sep 01 '11 at 10:45
  • Thanks for getting this sorted out! I hope someone follows this up with the similar question about `std::array`. I should point out that this question was motivated by a follow-up of the OP of [this question](http://stackoverflow.com/questions/7265157/is-this-nested-array-using-stack-or-heap-memory) -- I've encouraged him to ask about `std::array` separately, though. – Kerrek SB Sep 01 '11 at 10:55
  • @teacher — Actually, `int a[25][80]` is of size `80*25*sizeof(int)` bytes, which just might happen on most systems to be `80*25*4` bytes. – Todd Lehman Jun 28 '14 at 07:06
  • @ToddLehman: it's `25*sizeof(int[80])` and `sizeof(int[80])` might not be 80*4 bytes, in case of padding. It's unlikely for 80 and maybe unlikely for int. But a byte[3] might be aligned to 4 bytes. – Thomas Weller Oct 14 '22 at 12:09
  • @ThomasWeller Padding only applies to structs and unions, not arrays. The standard about `sizeof`: "When applied to an operand that has array type, the result is the total number of bytes in the array. When applied to an operand that has structure or union type, the result is the total number of bytes in such an object, including internal and trailing padding." – fredoverflow Oct 16 '22 at 09:19
  • @fredoverflow: yes, 4 would be the total number of bytes in the byte[3] array. This does not contradict. – Thomas Weller Oct 16 '22 at 14:27

6 Answers6

9

Both lines do result in undefined behavior.

Subscripting is interpreted as pointer addition followed by an indirection, that is, a[0][1234]/p[1234] is equivalent to *(a[0] + 1234)/*(p + 1234). According to [expr.add]/4 (here I quote the newest draft, while for the time OP is proposed, you can refer to this comment, and the conclusion is the same):

If the expression P points to element x[i] of an array object x with n elements, the expressions P + J and J + P (where J has the value j) point to the (possibly-hypothetical) element x[i+j] if 0≤i+j≤n; otherwise, the behavior is undefined.

since a[0](decayed to a pointer to a[0][0])/p points to an element of a[0] (as an array), and a[0] only has size 80, the behavior is undefined.


As Language Lawyer pointed out in the comment, the following program does not compile.

constexpr int f(const int (&a)[2][3])
{
    auto p = &a[0][0];
    return p[3];
}

int main()
{
    constexpr int a[2][3] = { 1, 2, 3, 4, 5, 6, };
    constexpr int i = f(a);
}

The compiler detected such undefined behaviors when it appears in a constant expression.

xskxzr
  • 12,442
  • 12
  • 37
  • 77
  • Does "launder" factor into this? Certainly a function which is supposed to operate on all the bytes of an object (e.g. `fwrite` or variations thereof) should be able to write out everything in a PODS, even if the PODS happens to start with a two-dimensional array. Deprecating code that does such things without laundering pointers might make sense (though quality compilers should offer an option to keep processing code according to old rules in exchange for some efficiency loss), but there must be *some* way to get a pointer that can access an entire item. – supercat Jun 11 '18 at 16:22
  • @supercat Maybe you are looking for [this question](https://stackoverflow.com/questions/47498585/is-adding-to-a-char-pointer-ub-when-it-doesnt-actually-point-to-a-char-arr)? – xskxzr Jun 11 '18 at 17:09
  • @supercat By the way, this question is about type punning for `int` array (not about the underlying object representation), and the standard clearly makes it undefined. – xskxzr Jun 11 '18 at 17:16
  • That other question didn't say anything about laundering. nor do the rules about indexing say anything special about character types. IMHO, the character-type exception is one of the worst things that ever happened to the C programming language (or its offshoot C++), since it needlessly blocks what should be useful optimizations even though there was never any need to treat character types differently from any other type that has no padding bits or trap representations. – supercat Jun 11 '18 at 17:31
  • @supercat That problem cannot be resolved with `launder`, the key is whether there is an underlying `char` array for object representation. [This answer](https://stackoverflow.com/a/47499092/5376789) shows the particularity of `char*`. – xskxzr Jun 11 '18 at 17:35
  • The act of converting an object's address to a `T*` should put a compiler on notice that the pointer will be used to accessed as a `T` or `T[]`, at least until the object is used via means not derived from that pointer or code enters a loop or function where that happens. Limiting that to character types is not necessary to facilitate useful optimizations, and treating characters as "fresh" forever would seldom be necessary to make programs work correctly (but does impede useful optimizations). – supercat Jun 11 '18 at 17:38
  • The pre-C++17 wording is valid only for the purposes of pointers comparison. _regardless of how the value was obtained_ does not mean that you can read some bits from `/dev/urandom` and use it as a pointer to some object if the bits happened to represent its address. – Language Lawyer Mar 26 '19 at 14:10
  • I'm not so convinced, because ** unrelated object **. What exactly is a **related object**? Is `a[1]` related to `a[0]` ? – MSalters Mar 26 '19 at 15:27
  • @LanguageLawyer I'm afraid this is valid in pre-C++17. Can you please show which paragraph forbids such use? – xskxzr Mar 26 '19 at 17:47
  • @MSalters It doesn't matter. The normative text already says "a pointer of type cv T* whose value is the address A is said to point to that object, regardless of how the value was obtained". In my example `p` happens to represent the address of `a[1][0]`, so it points to `a[1][0]`. – xskxzr Mar 26 '19 at 17:51
  • 1
    @xskxzr could a pre-C++17 compiler optimize `int f() { int i = 2; g(); return i; }` as `int f() { g(); return 2; }`? All reasonable implementations did this, even though `g` could guess the address of `i` and access it, because _regardless of how the value was obtained_. But _regardless of how the value was obtained_ was not meant to be interpreted literally. This was an unfortunate wording. – Language Lawyer Mar 26 '19 at 19:59
  • @xskxzr https://wandbox.org/permlink/L2M8lFWPAZQytr5S Clang with `-std=c++14` does not believe that the pointer becomes a pointer to the first element of the second row. Neither gcc believs. – Language Lawyer Mar 27 '19 at 09:54
  • @LanguageLawyer That may be because the compilers treat the change from C++14 to C++17 as a correction, so it does not keep the C++14 behavior any more. You can see [GCC 4.9.3 compiles it fine](https://wandbox.org/permlink/SI9Ipj9a3kTmg6SG) while [GCC 5.1.0 doesn't](https://wandbox.org/permlink/gnYNglJnduRqwNEs). Anyway, I have deleted the disputed part. – xskxzr Mar 27 '19 at 10:43
  • @xskxzr GCC prior to version 5 had terrible UB diagnostics in constant expression. The unfortunate wording was added to resolve [CWG73](http://wg21.link/cwg73), when the standard required padding between complete objects. The wording was changed, but not accurately, so unintendedly it became possible to "launder" a pointer past the end to point to the next row in a multidimensional array. The C++17 wording expresses the original intent correctly. There is no paragraph forbidding your usage, but there is knowledge of the history and the context. – Language Lawyer Mar 27 '19 at 10:55
  • @LanguageLawyer: Has the Standard added any clear and unambiguous way of getting a pointer that can be used to access the entire array? – supercat Mar 27 '19 at 15:37
  • @supercat `&a`. – Language Lawyer Mar 27 '19 at 17:07
  • @LanguageLawyer: As a simple example, how about code that wants to read or write all the numbers in an array, or perform some operation on all the items in an array? One could use nested loops, but even if a compiler were able to optimize that to be no worse than using a single loop, it would seem simpler to just write the code as a single loop in the first place. – supercat Mar 27 '19 at 17:09
  • @supercat in the modern world, people suffer too little. C++ is here to solve this problem. – Language Lawyer Mar 27 '19 at 19:44
  • @LanguageLawyer: In C++, is there any "defined" way of writing a function that can act upon all the elements of an arbitrarily-sized 2d array if it is passed a pointer to the array along with its dimensions? I don't think the rules about array bounds were intended to make that impossible, but when the Standard was first written, the authors didn't think it would matter if it actually defined behavior in such cases, since behaving in one obvious useful fashion would be easier for a compiler than doing anything else. Unfortunately, such technical debt has never been addressed. – supercat Mar 27 '19 at 21:07
  • @LanguageLawyer Thanks for your example, it is really helpful. I think I should add it to my answer. – xskxzr Apr 11 '19 at 08:15
6

It's up to interpretation. While the contiguity requirements of arrays don't leave much to the imagination in terms of how to layout a multidimensional arrays (this has been pointed out before), notice that when you're doing p[1234] you're indexing the 1234th element of the zeroth row of only 80 columns. Some interpret the only valid indices to be 0..79 (&p[80] being a special case).

Information from the C FAQ which is the collected wisdom of Usenet on matters relevant to C. (I do not think C and C++ differ on that matter and that this is very much relevant.)

Luc Danton
  • 34,649
  • 6
  • 70
  • 114
  • 1
    I think there is a signficiant difference between C and C++ here, as C99 includes the text in 6.5.6#8, "If the result points one past the last element of the array object, it shall not be used as the operand of a unary `*` operator that is evaluated". Therefore `&a[0][0] + 80` is not semantically identical to `&a[1][0]`, because the former shall not dereferenced via `*` but the latter can be. C++ does not have that text and (AFAICS) appears to leave unanswered the issue of when past-the-end iterators may be dereferenced. – M.M Jun 20 '14 at 00:33
  • 2
    In general knowing the layout isn't sufficient. Deriving the pointer safely is *also* required because of the possibility of segmented memory models. e.g. indexing one totally-separate array relative to another breaks easily, like `b[ a_minus_b + i ]`. Implementations can (for example on a segmented memory machine like x86) set a max size for any single object to be smaller than a segment, allowing pointer math to just work on the offset part of a seg:off address, and that can't "reach" `a` from `b`'s segment. [More about this reason for UB](https://stackoverflow.com/q/58322107). – Peter Cordes Dec 09 '19 at 00:02
2

In the language the Standard was written to describe, there would be no problem with invoking a function like:

void print_array(double *d, int rows, int cols)
{
  int r,c;
  for (r = 0; r < rows; r++)
  {
    printf("%4d: ", r);
    for (c = 0; c < cols; c++)
      printf("%10.4f ", d[r*cols+c]);
    printf("\n");
  }
}

on a double[10][4], or a double[50][40], or any other size, provided that the total number of elements in the array was less than rows*cols. Indeed, the guarantee that the row stride of a T[R][C] would equal C * sizeof (T) was designed among other things to make it possible to write code that could work with arbitrarily-sized multi-dimensional arrays.

On the other hand, the authors of the Standard recognized that when implementations are given something like:

double d[10][10];
double test(int i)
{
  d[1][0] = 1.0;
  d[0][i] = 2.0;
  return d[1][0]; 
}

allowing them to generate code that would assume that d[1][0] would still hold 1.0 when the return executes, or allowing them to generate code that would trap if i is greater than 10, would allow them to be more suitable for some purposes than requiring that they silently return 2.0 if invoked with i==10.

Nothing in the Standard makes any distinction between those scenarios. While it would have been possible for the Standard to have included rules that would say that the second example invokes UB if i >= 10 without affecting the first example (e.g. say that applying [N] to an array doesn't cause it to decay to a pointer, but instead yields the Nth element, which must exist in that array), the Standard instead relies upon the fact that implementations are allowed to behave in useful fashion even when not required to do so, and compiler writers should presumably be capable of recognizing situations like the first example when doing so would benefit their customers.

Since the Standard never sought to fully define everything that programmers would need to do with arrays, it should not be looked to for guidance as to what constructs quality implementations should support.

supercat
  • 77,689
  • 9
  • 166
  • 211
  • The [C-faq](http://c-faq.com/aryptr/ary2dfunc2.html) from the accepted answer says your first example is not in strict conformance with the ANSI C Standard. – xskxzr Apr 11 '19 at 08:58
  • @xskxzr: I said *in the language the Standard was written to describe*. The authors of the Standard have never made any effort to exhaustively catalog and mandate support for all the constructs that existing implementations processed usefully, and that they expected future implementations to process likewise *whether mandated or not*. The only time they thought mandates would matter would be when a compiler writer reasonably judged that its customers could benefit from having it do something different, in which case allowing the other behavior would be better than forbidding it. – supercat Apr 11 '19 at 15:01
  • @xskxzr: The authors of the Standard recognized that given `int a[10][10]; ... a[i][j]=23;`, it would be useful for some purposes to allow compilers to squawk if `j` exceeds 9. That in no way implies that they did not *equally* recognize that it would be useful for compilers to usefully process code like the first example above. Fundamentally, they recognized that programmers (and by extension, compiler writers honoring the Spirit of C "trust the programmer") were better placed than the Committee to know when each approach would best help them "do what needs to be done". – supercat Apr 11 '19 at 15:23
-2

Your compiler will throw a bunch of warnings/errors because of subscript out of range (line 2) and incompatioble types (line 3), but as long as the actual variable (int in this case) is one of the intrinsic base-types this is save to do in C and C++. (If the variable is a class/struct it will probably still work in C, but in C++ all bets are off.)

Why you would want to do this.... For the 1st variant: If your code relies on this sort of messing about it will be error-prone and hard to maintain in the long run.

I can see a some use for the second variant when performance optimizing loops over 2D arrays by replacing them by a 1D pointer run over the data-space, but a good optimizing compiler will often do that by itself anyway. If the body of the loop is so big/complex the compiler can't optimize/replace the loop by a 1D run on it's own, the performance gain by doing it manually will most likely not be significant either.

Tonny
  • 671
  • 5
  • 17
  • 2
    There are situations (granted, not frequent) where you need to reshape your data as a different kind of array, not for performance reasons, but from the mathematical interpretation behind the algorithm. You can do that without casts, but using them may lead to code that is easier to read. This kind of casting is pretty close to what, e.g., MATLAB's `reshape` function does. – Christopher Creutzig Sep 01 '11 at 11:02
-3

The memory referenced by a is both a int[25][80] and a int[2000]. So says the Standard, 3.8p2:

[ Note: The lifetime of an array object starts as soon as storage with proper size and alignment is obtained, and its lifetime ends when the storage which the array occupies is reused or released. 12.6.2 describes the lifetime of base and member subobjects. — end note ]

a has a particular type, it is an lvalue of type int[25][80]. But p is just int*. It is not "int* pointing into a int[80]" or anything like that. So in fact, the int pointed to is an element of int[25][80] named a, and also an element of int[2000] occupying the same space.

Since p and p+1234 are both elements of the same int[2000] object, the pointer arithmetic is well-defined. And since p[1234] means *(p+1234), it too is well-defined.

The effect of this rule for array lifetime is that you can freely use pointer arithmetic to move through a complete object.


Since std::array got mentioned in the comments:

If one has std::array<std::array<int, 80>, 25> a; then there does not exist a std::array<int, 2000>. There does exist a int[2000]. I'm looking for anything that requires sizeof (std::array<T,N>) == sizeof (T[N]) (and == N * sizeof (T)). Absent that, you have to assume that there could be gaps which mess up traversal of nested std::array.

Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
  • 4
    The elements of an `int[25][80]` are `int[80]`'s .Not `int`. There is no `int[2000]` object. I'm not sure what the relevance of the text you quoted is supposed to be; an aggregate and its subobjects are all allocated and released simultaneously. – M.M Jun 20 '14 at 00:28
  • @MattMcNabb: How can you claim there is no `int[2000]` object, when the rules say that the lifetime has started? (Clearly "storage with proper size and alignment is obtained" has been satisfied) – Ben Voigt Jun 20 '14 at 00:34
  • There's no declaration or otherwise of an `int[2000]`. So there is no such object. (Let alone an object with whatever lifetime). Storage has been obtained for an `int[25][80]`. That's the same amount of storage as an `int[2000]` would take, but so what? `long[1000]` also has the same amount of storage (on some systems) – M.M Jun 20 '14 at 00:35
  • @Matt: So if I wrote `int* q = *reinterpret_cast(p);` would that make any difference? Now an `int[2000]` has been declared (and decayed right back to a pointer). Yes, `long[1000]` has the same amount of storage, but if you use `int[1000]` to access `long[1000]` you can't subscript it without violating strict-aliasing rules, because there are no `int` objects, only `long` objects. – Ben Voigt Jun 20 '14 at 00:41
  • @MattMcNabb: But the strict aliasing rule only applies to the `int` objects which are finally accessed by dereferencing the `int*`, and those `int` objects definitely do exist. Also, would it make a difference to you if I wrote `int* q = new (p) int[2000];`? But my Standard quote says placement-new (to run a constructor) isn't needed for arrays, they exist as soon as storage is obtained. – Ben Voigt Jun 20 '14 at 00:45
  • @MattMcNabb: Wrong on your last claim, see p7 of `[basic.life]`: "If, after the lifetime of an object has ended and before the storage which the object occupied is reused or released, a new object is created at the storage location which the original object occupied, a pointer that pointed to the original object, a reference that referred to the original object, or the name of the original object will automatically refer to the new object and, once the lifetime of the new object has started, can be used to manipulate the new object, if..." – Ben Voigt Jun 20 '14 at 00:59
  • @Matt: `p` has type `int*`, remember? And an `int` object most definitely occupies the exact same storage. Bullet 4 excludes base class subobjects, it does not exclude member subobjects or elements of arrays. – Ben Voigt Jun 20 '14 at 01:03
  • (Note, I originally took your `p` to mean the array, so my comments were not correct for `p` (although I stick by them for `a`!)). – M.M Jun 20 '14 at 01:16
  • Back to where we were. For clarity I'll rewrite: `int *p = &a[0][0]; int (*r)[2000] = reinterpret_cast(p); int *q = *r;` . The question is whether `p` can be used to iterate over all 2000 elements, and whether `q` can be used to iterate over all 2000 elements. – M.M Jun 20 '14 at 01:29
  • Strict aliasing does not seem to be relevant here; that only refers to *glvalue* , whereas decayed arrays are *prvalue*. So strict aliasing only applies to the `int` subobjects so there is no strict aliasing violation. – M.M Jun 20 '14 at 01:31
  • @Matt: your summary is perfect – Ben Voigt Jun 20 '14 at 02:10
  • @BenVoigt: Never use placement-new[] to initialize an array of some type. It could use some of the memory for storing an element-count, with sommewhat unfortunate consequences. – Deduplicator Feb 03 '15 at 17:07
  • @BenVoigt: Please see 5.3.4 expr.new §§13+14: `This overhead may be applied in all array new-expressions, including those referencing the library function operator new[](std::size_t, void*) and other placement allocation functions.` – Deduplicator Feb 03 '15 at 21:36
  • @Deduplicator Oh you're right I was just thinking of the requirements on the allocation function itself. Anyway my comment above was saying that placement new isn't being used and does not need to be. – Ben Voigt Feb 03 '15 at 21:59
  • @Deduplicator: That was [a bug](http://open-std.org/JTC1/SC22/WG21/docs/cwg_defects.html#2382) that has since been fixed. – Davis Herring Feb 09 '21 at 04:46
-3

You're free to reinterpret the memory any way you'd like. As long as the multiple does not exceed the linear memory. You can even move a to 12, 40 and use negative indexes.

Tavison
  • 1,523
  • 1
  • 11
  • 19