Accessing bytes of an object in C

Question

Unfortunately, I haven't found anything like std-discussion for the ISO C standard, so I'll ask here.

Before answering the question, make sure you are familiar with the idea of pointer provenance (see DR260 and/or http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2263.htm).

6.3.2.3(Pointers) paragraph 7 says:

When a pointer to an object is converted to a pointer to a character type, the result points to the lowest addressed byte of the object. Successive increments of the result, up to the size of the object, yield pointers to the remaining bytes of the object.

The questions are:

What does "a pointer to an object" mean? Does it mean that if we want to get the pointer to the lowest addressed byte of an object of type T, we shall convert a pointer of type cvT* to a pointer to a character type, and converting a pointer to void, obtained from the pointer of type T*, won't give us desired result? Or "a pointer to an object" is the value of a pointer which follows from the pointer provenance and cast to void* does not change the value (analogous to how it was recently formalized in C++17)?
Why the paragraph explicitly mentions increments? Does it mean that adding value greater than 1 is undefined? Does it mean that decrementing (after we have incremented the result several times so that we won't go beyond the lower object boundary) is undefined? In short: is the sequences of bytes composing an object an array?

"a pointer to an object" --> a pointer to some non-function, non-`void` . — chux - Reinstate Monica, Aug 16 '18 at 00:08
"cast to void* does not change the value" is almost correct. The _encoding_ may change, yet it will equate to the original. Akin to `10 == 10.0` but they have different encodings. Pointers have different properties than integers. There are many questions here - at least 7. Perhaps reduce it. — chux - Reinstate Monica, Aug 16 '18 at 00:14
While a reasonable person *might* question the validity of "increment by -1" (rather than "decrement by 1"), I think it's safe to say that you can increment by any positive number. `i+=5;//a stupid but accurate comment here is 'increment by 5'`. Regarding your very last question, are you specifically wanting to know if if the bytes are contiguous and CAN be treated as a C-array, or are you driving at something different? — zzxyz, Aug 16 '18 at 00:39
@zzxyz I know that they are contiguous. The question is are they an array or not. — A language lawyer, Aug 16 '18 at 00:42
@zzxyz I don't think that you can increment by any positive number. The standard defines increment in 6.5.2.4 (postfix) and in 6.5.3.1 (prefix). I believe it is not correct to invent your own meaning for terms defined in the standard. — A language lawyer, Aug 16 '18 at 00:49
@LeoTolstoy - I would say they are defining the meanings of operators rather than explicitly defining a new meaning for the word `increment`, but it's perhaps arguable. (because they do use the word with an *implied* value of 1) — zzxyz, Aug 16 '18 at 00:56
The Standard talks about _array types_ and _array objects_, where an array object is the representation of an array type value in memory. It seems pretty clear that a character pointer to an object is a pointer to an array object, and this is the focus of the discussion of pointer arithmetic in [§6.5.6](https://port70.net/~nsz/c/c11/n1570.html#6.5.6), so it should be fine to add values larger than 1 to the pointer, so long as bounds of the array object are respected. — ad absurdum, Aug 16 '18 at 02:35
Regarding 2, my interpretation is the reason the C standard uses this language about pointer increments when describing accesses to bytes of the object merely as a matter of convenience while drafting the standard, avoiding the labor of writing a formal description, and the intent is for the bytes to act as an array, as implied by C 2018 6.2.6.1 2 (objects are composed of sequences of bytes). — Eric Postpischil, Aug 16 '18 at 04:49
Regarding 1, I do not see the provenance issues to point to as being relevant. Certainly a `void *` derived from `T *` has the necessary provenance. There is a question whether the converted `void *` qualifies as a “pointer to an object” for the purposes of 6.3.2.3 7, but that is a question of language in 6.3.2.3 7, not of pointer provenance. — Eric Postpischil, Aug 16 '18 at 04:50
@DavidBowling so when we create an object with an automatic or static storage duration, we actually create more than one object, which overlap each other exactly? One object with the declared type and 3 objects with 'array of cv(declared type) {,signed,unsigned} char of size sizeof(declared type)' types? What is the difference between objects and the storage they occupy, then? — A language lawyer, Aug 16 '18 at 08:17
when the standard says "doing X results in Y", it does not imply "not doing X results in not-Y". This seems to answer all of your questions — M.M, Aug 16 '18 at 08:20
@EricPostpischil 2) is any sequence of objects of the same type automatically an array objects? What about subsequences? Multidimensional arrays? 1) I'm familiar with C++17 formalization of a pointer value and (where cast to `void*` does not change the pointer value, where the pointer value could be "pointer to an object"), so I'm looking at the pointer provenance from this angle. It is not necessary how the provenance is going to be formalized in C. But for me "`void*` derived from `T*` has the necessary provenance"="cast to `void*` doesn't change the value, it is still pointer to an object" — A language lawyer, Aug 16 '18 at 08:32
@M.M If I understand you correctly, you are trying to say that the omission of an explicit definition of behavior does not make it undefined. Well, the standard explicitly states the opposite. — A language lawyer, Aug 16 '18 at 08:36
@Alanguagelawyer you don't understand me correctly. It seems to me you are being argumentative for the sake of it; this is all basic and obvious stuff and you are creating implications that aren't there — M.M, Aug 16 '18 at 08:37
Pointer arithmetic is defined by the pointer arithmetic section, it is not "undefined by omission" if some other part of the standard also mentions pointer arithmetic without copy-pasting the whole pointer arithmetic section — M.M, Aug 16 '18 at 08:39
@M.M some time ago I was also thinking that everything is all basic and obvious. Until I've read committee responses to a couple of defect reports such as DR260. — A language lawyer, Aug 16 '18 at 08:43
@M.M the problem is that the standard for some reason hesitate to state explicitly that the bytes of storage underlying an object compose an array object. — A language lawyer, Aug 16 '18 at 08:46
I VTC as "too broad", there are 7 questions in the "question" plus a pile more questions and argumentation in the comments. The format of this site is a single question that is answerable; it's not a discussion forum or an interactive tutorial. — M.M, Aug 16 '18 at 08:56
@EricPostpischil: The provenance issue could be relevant in some issues involving nested arrays. If a pointer to a multi-dimensional array of a character type is converted to a character pointer, an obtuse compiler could allow for a pointer to be successively incremented through the entire array while treating other indexing operations on the pointer as indexing operations on the first row of the array. — supercat, Aug 16 '18 at 16:28
@Alanguagelawyer: In the language designed by Dennis Ritchie, any action which allocates a region of storage would simultaneously create all possible objects whose size and alignment would let them fit in that storage. Changing any byte within that region could change the values of all objects containing that byte. In the absence of the "aliasing rules", or if the rules were tweaked to say "An object which is accessed *during a particular execution of a function or loop* using an lvalue derived from a particular type within said execution shall have its stored value... — supercat, Aug 16 '18 at 16:32
...accessed (within said execution) only by an lvalue which is of, or is freshly derived from, one of the following types". There's no reason the "aliasing rules" should concern themselves with objects that aren't being used, but the fact that an object isn't being used in a particular context doesn't mean it doesn't exist. — supercat, Aug 16 '18 at 16:37
@M.M: No, this is my only account, for better or worse. I do not agree this is basic and obvious stuff; the C standard is incomplete from a formal perspective, and I think this question asks valid questions. Which may be fine; the standard has served admirably to advance the industry from the time when it was created. And questions like this over the years have helped me appreciate (a) that incompleteness and lack of formality in the standard, (b) the spirit and intentions of the standard, and (c) how to understand the standard on different levels and how to do better engineering. — Eric Postpischil, Aug 16 '18 at 16:53
@EricPostpischil: The Standard serves pretty well in the hands of compiler vendors who recognize what the authors were trying to do. It focuses on things that people seeking to write useful implementations might not otherwise do (e.g. allow aliasing between seemingly-unrelated pointers to signed and unsigned types) and glosses over things they expected such people to do with or without a mandate. They didn't think anyone would care about whether a deliberately-less-than-ideal-quality implementation would or would not be conforming, because they didn't expect anyone to try to write one. — supercat, Aug 16 '18 at 17:07
@Alanguagelawyer: When the C89 Standard was written, one of the things that made C useful was that most implementations would handle most forms of UB "in a documented fashion characteristic of the environment" in cases where the environment documented a behavior that was sometimes useful, and when doing so was simpler than doing anything else. There was at the time no need for the Standard to concern itself with when quality implementations should be expected to behave in such fashion, because it would usually be pretty obvious to anyone familiar with the environment in question. — supercat, Aug 16 '18 at 17:13
Because of improvements in other languages, the tasks for which C remains the language of choice are primarily those which benefit from "environmental behaviors". Because of developments in compiler design, however, it's no longer clear when quality compilers should be expected to expose those natural platform behaviors to the programmer. IMHO, the Standard should define a category of implementations where all but two categories of actions would, at worst, invoke an unspecified choice from among an implementation-defined set of possibilities that may involve the environment. — supercat, Aug 16 '18 at 17:19
If a program writes to storage which is "owned" by the compiler [as opposed to the program or the environment], all bets are off. Likewise if the compiler documents requirements for the environment (e.g. the DF flag must be left clear) and something causes that requirement to be violated, or if an implementation specifies that an action will do something to the environment and the environment doesn't document the effect thereof. If, however, the environment *does* document the effects of all the things an implementation might do, a program should behave consistent with that. — supercat, Aug 16 '18 at 17:23
What's interesting to me about this being laid out like this is that it does seem to go out of its way to **not** say that a `char *` will behave like we all expect it to. Either because objects are not guaranteed to be contiguous in application memory or because `(charPtr +1) - charPtr` is not guaranteed to equal 1? — zzxyz, Aug 16 '18 at 18:52
@zzxyz: The authors of the Standard were trying to describe a language that already existed and was in wide use, and generally tried to minimize redundancy. If the Standard had merely said that converting a `T*` to a `char*` will yield something that will behave as a `char[]`, but hadn't specified the relationship between the elements thereof and the bytes of the original object, some implementations might have done something unusual. For example, on some 8-bit platforms the most efficient way to store something that would be usable as an `int[10]` would be to... — supercat, Aug 16 '18 at 19:33
...use ten bytes to hold the upper half of each element and ten bytes to store the lower half. If the requirement were merely that converting an `int[10]` to a `char*` must yield something that's usable as a `char[]`, such an approach would have been allowable for an `int[10]` whose address would never be passed to code that would treat it as an `int[]` of unknown size, even if that code used it it as a `char*`. On the other hand, it would have been hard for an implementation to honor the sequencing requirement without also allowing indexing. — supercat, Aug 16 '18 at 19:36
I've been told that `(int)(charPtr+1) - (int)charPtr == 1` is not (or at least *was* not) a safe assumption across all architectures and OSes. Drum memory, extremely small memory areas (leading to extreme fragmentation), and lack of OS support for virtual/application memory, were all cited as reasons this could be the case. If a compiler can allocate a 500 byte structure in a non-contiguous memory area, or fail to allocate the memory, it is obviously reasonable for it to split the data if this is a common occurrence. I was unable to verify this, though. — zzxyz, Aug 16 '18 at 22:21
"Increment" & "decrement" are not operators on variables here, these words are being used in an everyday sense. The standard is talking about consecutive addresses of bytes in the memory of the abstract machine. PS You don't seem to understand the C object model. — philipxy, Aug 18 '18 at 08:55
@philipxy it is not easy to understand something that is much underspecified — A language lawyer, Aug 18 '18 at 11:28

supercat · Answer 1 · 2018-08-16T19:04:33.800

The general description of pointer addition suggests that for any values/types of pointer p and signed integers x and y where ((ptr+x)+y) and (x+y) are both defined by the Standard, (ptr+(x+y)) would behave equivalent to ((ptr+x)+y). While it might be possible to argue that the Standard doesn't explicitly say that incrementing a pointer five times would be equivalent to adding 5, there is nothing in the Standard that would suggest that quality implementations should not be expected to behave in that fashion.

Note that the authors of the Standard didn't try to make it "language-lawyer-proof". Further, they didn't think anyone would care about whether or not an obviously-inferior implementation was "conforming". An implementation that only works reliably if bytes of an object are accessed sequentially would be less versatile than one which supported reliable indexing, while offering no plausible advantage. Consequently, there should be no need for the Standard to mandate support for indexing, because anyone trying to produce a quality implementation would support it whether the Standard mandated it or not.

Of course, there are some constructs which programmers in the 1990s--or even the authors of the Standard themselves--expected quality compilers to handle reliably, but which some of today's "clever" compilers don't. Whether that means such expectations were unreasonable, or whether they remain accurate when applied to quality compilers, is a matter of opinion. In this particular case, I think the implication that positive indexing should behave like repeated incrementing is strong enough that I wouldn't expect compiler writers to argue otherwise, but I'm not 100% certain that no compiler would ever be "clever"/obtuse enough to look at something like:

int test(unsigned char foo[5][5], int x)
{
  foo[1][0] = 1;

  // Following should yield a pointer that can be used to access the entire
  // array 'foo', but an obtuse compiler writer could perhaps argue that the
  // code is converting the address of foo[0] into a pointer to the first
  // element of that sub-array, and that the resulting pointer is thus only
  // usable to access items within that sub-array.

  unsigned char *p = (unsigned char*)foo;

  // Following should be able to access any element of the object [i.e. foo]
  // whose address was taken

  p[x] = 2;

  return foo[1][0];
}

and decide that it could skip the second read of foo[1][0] since p[x] wouldn't possibly access any element of foo beyond the first row. I would, however, say that programmers should not try to code around the possibility of vandals writing a compiler that would behave that way. No program can be made bullet-proof against vandals writing obtuse-but-"conforming" compilers, and the fact that a program can be undermined by such vandals should hardly be viewed as a defect.

Great answer. There's a recent related question (for C++) here. Biggest difference in that case being that the behavior is explicitly called out as undefined: https://stackoverflow.com/questions/51623643/how-can-deleting-a-void-pointer-do-anything-other-than-invoke-the-global-delete — zzxyz, Aug 16 '18 at 18:11
@zzxyz: I don't see what the question about `delete` has to do with accessing the bytes of an object? Did you copy the wrong link? — supercat, Aug 16 '18 at 19:02
No I didn't, but I should've clarified what I see as the similarity, which is this: The point at which common sense tells you the compiler should be treating a pointer as "just an integer" (or perhaps "just a `void*`") is not guaranteed. Some of the comments and answers specifically call that out (regardless of the non-existence of custom destructors). — zzxyz, Aug 16 '18 at 19:12
@zzxyz: The C89 Standard would have been a fair bit bigger if the authors had tried to enumerate all behavioral guarantees a quality compiler should uphold, including those where (1) all existing compilers upheld them; (2) the easiest way of upholding other parts of the Standard would also uphold the guarantees in question. The authors didn't think anyone would care whether a compiler that behaved in deliberately-less-than-ideal fashion would be "conforming", since they didn't think anyone would try to use the Standard to justify their behavior. — supercat, Aug 16 '18 at 19:55

score 0 · Answer 2 · answered Aug 16 '18 at 00:58

0

Take a non-char c object and create a pointer to it, i.e.

int obj;
int *objPtr = &obj;

convert the pointer to object to pointer to char:

char *charPtr = (char *)objPtr;

now, charPtr points to the lowest byte or the int obj. increment it:

charPtr++;

now it points to the next byte of the object. and so on till you reach the size of the object:

int i;
for (i = 0; i < sizeof(obj); i++) 
    printf("%d", *charPtr++);

answered Aug 16 '18 at 00:58

Serge

11,616
3
18
28

If you change to `char *charPtr = (char *)((void*)objPtr);`, where the `charPtr` will point to? Why? – A language lawyer Aug 16 '18 at 01:02
2

This is not an answer to any of the questions. – A language lawyer Aug 16 '18 at 01:21
This question is not about accessing the bytes that represent an object. It is asking about specific details in the language of the C standard. For example, if you convert a pointer to an object to a pointer to void, is it still a “pointer to an object”? That is, is the `void *`, for the purposes of C 2018 6.3.2.3 7, a pointer to the same object as the original pointer, regardless of the type? (We are not asking if it can be dereferenced while it is a `void *`, just whether it still qualifies for the conversion semantics specified in 6.3.2.3 7.) – Eric Postpischil Aug 16 '18 at 04:32
To be more specific, given the `charPtr` initialized above, the standard clearly says we can access the bytes repeatedly via `*charPtr++`. But does it says we can access a byte via `charPtr[2]`? There are no successive increments in that, so the language in 6.3.2.3 7 does not say so explicitly. That is one of the things the question is getting at. – Eric Postpischil Aug 16 '18 at 04:53

Accessing bytes of an object in C

2 Answers2