30

The C standard specifies:

A pointer to void shall have the same representation and alignment requirements as a pointer to a character type. Similarly, pointers to qualified or unqualified versions of compatible types shall have the same representation and alignment requirements. All pointers to structure types shall have the same representation and alignment requirements as each other. All pointers to union types shall have the same representation and alignment requirements as each other. Pointers to other types need not have the same representation or alignment requirements.

i.e. sizeof(int*) is not necessarily equal to sizeof(char*) - but sizeof(struct A*) is necessarily equal to sizeof(struct B*).

What is the rationale behind this requirement? As I understand it the rationale behind differing sizes for basic types is to support use cases like near/far/huge pointers (edit: as was pointed out in comments and in the accepted answer, this is not the rationale) - but doesn't this same rationale apply to structs in different locations in memory?

Daniel Kleinstein
  • 5,262
  • 1
  • 22
  • 39
  • 11
    I believe it is because of *opaque* pointers. The compiler needs to allocate a pointer without knowing the actual `struct` internals. – Eugene Sh. Aug 09 '21 at 20:16
  • 2
    It would be impossible to write type-erased interfaces without this, and forward declarations of structs moot, severely constraining the programmer. – SergeyA Aug 09 '21 at 20:19
  • 2
    @SergeyA Why would it be impossible to write type-erased interfaces? Functions like `qsort` work just fine on types that don't have the same-pointer-size restriction. – Daniel Kleinstein Aug 09 '21 at 20:29
  • 1
    You can use a pointer to struct (or union) before the relevant struct (or union) is defined. `struct node { struct node *next; /* struct node undefined here, but next is ok */ ... };` – pmg Aug 09 '21 at 20:34
  • @SergeyA type-erasure only requires a valid conversion to exist – Ajay Brahmakshatriya Aug 09 '21 at 20:35
  • 5
    The rationale for differing sizes is not near/far pointers: those require an extension to standard C anyway. It's pointers to words vs pointers to bytes on architectures where individual bytes can't be accessed directly, only words, so a pointer to a byte is a word pointer plus some extra information indicating which part of the word the pointer points to. (Also data and code pointers can have different size, which is a little less exotic.) – Gilles 'SO- stop being evil' Aug 09 '21 at 20:38

2 Answers2

36

The answer is very simple: struct and union types can be declared as opaque types, ie: without an actual definition of the struct or union details. If the representation of pointers was different depending on the structures' details, how would the compiler determine what representation to use for opaque pointers appearing as arguments, return values, or even just reading from or storing them to memory.

The natural consequence of the ability to manipulate opaque pointer types is all such pointers must have the same representation. Note however that pointers to struct and pointers to union may have a different representation, as well as pointers to basic types such as char, int, double...

Another distinction regarding pointer representation is between pointers to data and pointers to functions, which may have a different size. Such a difference is more common in current architectures, albeit still rare outside operating system and device driver space. 64-bit for function pointers seems a waste as 4GB should be amply sufficient for code space, but modern architectures take advantage of this extra space to store pointer signatures to harden code against malicious attacks. Another use is to take advantage of hardware that ignores some of the pointer bits (eg: x86_64 ignores the top 16 bits) to store type information or to use NaN values unmodified as pointers.

Furthermore, the near/far/huge pointer attributes from legacy 16 bit code were not correctly addressed by this remark in the C Standard as all pointers could be near, far or huge. Yet the distinction between code pointers and data pointers in mixed model code was covered by it and seems still current on some OSes.

Finally, Posix mandates that all pointers have the same size and representation so mixed model code should quickly become a historical curiosity.

It is arguable that architectures where the representation is different for different data types are vanishingly rare nowadays and it be high time to clean up the standard and remove this option. The main objection is support for architectures where the addressable units are large words and 8-bit bytes are addressed using extra information, making char * and void * larger than regular pointers. Yet such architectures make pointer arithmetics very cumbersome and are quite rare too (I personally have never seen one).

chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • 1
    C does work just fine with tagged pointer types, where some bits of the pointer are used for type checking by the hardware. I've used that one a lot :) – Kuba hasn't forgotten Monica Aug 09 '21 at 20:56
  • 1) Your argument suggests that all struct pointers need be the same size. I think I can defeat it by constructing a counterexample but it would be a constrained counterexample exhibiting properties that no real machine would have. – Joshua Aug 09 '21 at 21:03
  • 2) There ain't no rule that all struct pointers require the same alignment. The actual rule is all struct pointers of the same size require the same alignment. I have much code that depends on the fact that `struct { char a; char b; }` has size 2 and requires only `char` alignment. – Joshua Aug 09 '21 at 21:04
  • 2
    3) I have used a nonconforming compiler that had `sizeof(char *)` be 2 but `sizeof(const char *)` be 3. It was about as annoying as it sounds. – Joshua Aug 09 '21 at 21:06
  • 1
    @Joshua: That's fine, your `struct` probably has a size of 2 and alignment of 1, but a pointer to such a `struct` has the same size and alignment as any other `struct` pointer. – chqrlie Aug 09 '21 at 21:06
  • @chqrlie: Oh; you mean the alignment of the pointer not the struct. See point 1; I can only construct non-useful counterexamples to that. – Joshua Aug 09 '21 at 21:07
  • @Joshua Just curious. _What_ arch/OS/compiler was that? Did the processor dictate that? – Craig Estey Aug 09 '21 at 21:08
  • 3
    @CraigEstey: Some Microchip processor. RAM was only 32k but ROM was bigger than 64k and `const char *` could point into ROM. – Joshua Aug 09 '21 at 21:09
  • Function pointers that consist of a pointer to code and a pointer to an instance of global data are also fairly common, as it allows code pages to be shared between processes even if they are mapped to different addresses. – Simon Richter Aug 10 '21 at 09:42
  • isn't a handle a larger pointer, well, of course it can do more stuff than a normal pointer ever could, but basically it is just a pointer on X [ a hardware device, a driver etc. ] – clockw0rk Aug 10 '21 at 10:13
  • 3
    "Posix mandates ... should quickly become a historical curiosity." POSIX shows no sign of taking significant marketshare in the embedded space that accounts for the vast majority of devices programmed in C. – Ben Voigt Aug 10 '21 at 15:06
  • @CraigEstey: Some microcontrollers have multiple address spaces, some of which may be limited to 256 bytes, some to 65536 bytes, and some of which may be able to go larger. Further, they require the use of different instructions to access things in different address spaces. The HiTech C Compiler (and perhaps some others) will process reads through `const`-qualified pointers by checking part of the pointer to see which address space its target is in, and then use an appropriate instruction to access storage there. Generating such code for non-const accesses would grossly degrade efficiency. – supercat Aug 10 '21 at 17:04
  • @supercat Yes, I was aware of something like the Intel 8051 when I asked the question. It must generate different instructions based on which memory bank is being addressed. So, the type of the pointer must be retained. But, naively, I assume the size of a pointer is distinct from the address range it may point to. That is (e.g.), we have a pointer to a 256 byte rom or a 64KB ram [or vice versa]: `rom *romptr; ram *ramptr;`. They point to different length memory segments/banks. But, I would expect: `sizeof(romptr) == sizeof(ramptr)`. AFAICT, on 8051 the bank addresses overlap? – Craig Estey Aug 10 '21 at 19:14
  • @CraigEstey: On the 8051, an IDATA pointer can access storage in IDATA, DATA, or BDATA. Other address ranges are considered distinct, but depending upon external hardware may or may not overlap (the compilers I've seen are agnostic to such possibilities). A "universal" pointer is three bytes, and uses a byte to identify an address space and one or two bytes to identify an address within that range. Some people regard implementations that need such qualifiers as inferior, but if I need to target an 8051-based piece of hardware, I'd much rather have such qualifiers available than not. – supercat Aug 10 '21 at 19:27
  • 1
    @BenVoigt I think, in the very near future, no-one will be using _true_ microcontrollers anymore; instead every embedded device will be an IoT box attached to a cheap $5 "1080p" panel from Alibaba running a GUI made in Electron - because, frankly, it'll be far cheaper than spending dev-hours fiddling with ROM burners and the like – Dai Aug 10 '21 at 19:28
  • 1
    @CraigEstey: Also, on compilers for MS-DOS, if one was using "small" or "medium" model, most pointers were limited to accessing a 64K region of memory, but "far"-qualified pointers could access anything. Having to use "far" qualifiers for things that could access things that were "far-allocated" was a slight nuisance, but having most pointer operations default to "near" made them more than twice as fast to process as "far" pointer accesses would have been. – supercat Aug 10 '21 at 19:29
  • @Dai: For mains-powered devices, that may become a more common paradigm, but many applications are at least somewhat sensitive to battery weight and run time. – supercat Aug 10 '21 at 19:31
  • @supercat Is the universal pointer part of the H/W arch? Or, a concept in the compiler (e.g. `near/far/huge` pointers for 8086)? For IBM/370, registers are 32 bits but all instructions ignored the upper 8 bits when the register was used as an address (i.e. a 24 bit address range). So, S/W used the upper byte as a segment/overlay number. Because the H/W ignored it, the upper byte did _not_ need to be masked--and this was used heavily in the S/W circa 1970+ – Craig Estey Aug 10 '21 at 19:32
  • 1
    @Dai: I don't think you have any clue what "embedded" really means. Did you mean to say that Electron is going inside the display panel, or inside the "IoT box"? I promise you that whatever is responsible for real-time control of motors and sensors is never going to be built on internet technology. Non-deterministic timing simply is not acceptable at the edge control system. – Ben Voigt Aug 10 '21 at 19:38
  • 1
    @BenVoigt My comment reply was facetious - alluding to the sad state of affairs we find ourselves in - have you seen "Etcher", for example? It's a 140MB Electron wrapper over `dd` (~80KB) – Dai Aug 10 '21 at 19:58
  • @Dai: I haven't seen that but it doesn't surprise me in the least. The vast majority of desktop developers are very wasteful; web developers even moreso. – Ben Voigt Aug 10 '21 at 20:01
4

In the C language invented by Dennis Ritchie, when a C compiler encountered a definition for struct foo *p; it would have no need to care about whether or how the structure was defined unless or until a program used pointer arithmetic or the -> operator. Otherwise, it could simply record that p was a pointer to a structure with tag foo without having to know or care about if, where, or how such a structure might be defined. The Standard adds an odd little wrinkle which sometimes makes structure pointers with matching tags incompatible, but the issue remains that a compiler must be able to process a declaration of a pointer-to-structure type, as well as basic assignments between such pointers, in cases where it might not know the contents of a structure.

Note that on platforms where pointers to objects with arbitrary alignment may be larger than pointers to objects that are known to have int alignment, a compiler might sensibly specify that all structures have int alignment even if they only contain character members. Further, compilers for such platforms might decide to process pointers to unions in such a way as to allow a pointer to any object--even a character--to be converted into a pointer to any union containing such an object, and used to access that object within the union. This may require that pointers to union objects be the size of a byte pointer, rather than a smaller int pointer.

Note that in pre-standard compilers, if two structures contained matching members, a function that accepted a void* and converted it into one structure type would have been expected to be usable to operate on both types interchangeably. Unfortunately, the Standard allows compilers to assume that code will never do such a thing, and provides no means for programmers to indicate when two structures should be usable interchangeably.

supercat
  • 77,689
  • 9
  • 166
  • 211
  • I think the standard (even C89) does provide for that, for any structures which have the same initial members (if all members from start to the ones you wish to access in both structs are not common, you have not told the compiler to align those members properly): one, you can put the common members into a single struct definition, and use that struct as the first member of any other struct that should be usable interchangeably; two, you can put the structs into a union, and access the common initial members of each struct through either of those struct members of the union. – mtraceur Nov 29 '21 at 06:06
  • @mtraceur: Given declarations `struct {int x;} *p1; struct {int x;} *p2;` and code `p1->x = 1; p2->x=2; return p1->x;`, the aliasing optimization logic in both clang nor gcc will (unless disabled with `-fno-strict-aliasing`) generate code that unconditionally returns 1. – supercat Nov 29 '21 at 15:40
  • We're going to need to get more specific. I'm not seeing the same result. When I compile a file containing just `int f(struct {int x;} *p1, struct {int x;} *p2) { p1->x = 1; p2->x = 2; return p1->x; }` with `clang -c -Os -std=c89 -pedantic` (same result with `-O3` instead of `-Os`, and with `c17` instead of `c89`), `f` turns into the following machine instructions on this ARMv8/AArch64 machine I'm on right now: `mov w8, #0x1`, `mov w9, #0x2`, `str w8, [x0]`, `str w9, [x1]`, `ldr w0, [x0]`, `ret`. – mtraceur Nov 29 '21 at 18:35
  • Same result on an x86-64 machine I just tried it on with Clang. However, I went to reproduce it on godbolt and on godbolt I see what you're reporting with latest GCC (but not with latest Clang). I am willing to believe that GCC is technically correct here, but this is still all somewhat besides the point, because that's not what I had in mind when I said the standard provides a way to do this. – mtraceur Nov 29 '21 at 18:57
  • What I had in mind is that the standard provides several ways to explicitly tell the compiler "I intend for these two struct types to have common layout/members (at least at the beginning) and I intend for this function to operate on any struct with this common layout". – mtraceur Nov 29 '21 at 19:03
  • The first way to do this is provided by the standard defining that you can legally cast a structure pointer to a pointer to its first member, and vice-versa. In this example where the only common member is an int, that means casting the pointers to int pointers, but the more general solution is wrapping all the common initial members in a common struct, then casting the struct pointers to pointers to that common first member struct. Doing that for both writes through the struct pointers makes it clear to the compiler that both writes can alias each other, and solves this problem on GCC too. – mtraceur Nov 29 '21 at 19:05
  • The second way is to write functions that are meant to operate on any struct with the same layout in terms of a pointer to the struct type which contains all those common members. Of course the key nuance here is that `p1` and `p2` in the example we've been trying are not representative of two pointers to a common initial struct! `struct {int x;};` and `struct {int x;};` are exactly the same, but I think GCC in my tests is interpreting those as two different struct types which just happen to have the same textual declaration, and I think this might be standard-compliant. – mtraceur Nov 29 '21 at 19:11
  • The standard-provided way to make it clear that two pointers to the same struct are actually pointing to the same type of struct is to give the struct type a tag name: when I change the code as follows, both GCC and Clang recognize the possible aliasing: `struct common { int x; }; int f(struct common * p1, struct common * p2) { p1->x = 1; p2->x = 2; return p1->x; }`. – mtraceur Nov 29 '21 at 19:13
  • So the second way also relies on the legality of the pointer cast from struct to its first member, but in the second way the pointer conversion happens at the place where the function is invoked rather than inside of the function. I think the second way is better because it thoroughly, end-to-end, makes *explicit* that you intend two or more structures to have a common initial layout, and that you intend your function to operate on any structure with that initial layout (it also gives that initial layout a distinct type+name in the C type system, which helps with the explicit intent showing). – mtraceur Nov 29 '21 at 19:26
  • And the third way, with unions, makes it possible even if no one has defined a common struct for the common initial sequence. If one header has `struct s1 {int x; float y; };`, and another header has `struct s2 { int x; char * z; };`, the standard provides a special case for structs with common initial sequences in unions, so you can do `union common {struct s1 s1; struct s2 s2; };` or even `union common { struct { int x; } common; struct s1 s1; struct s2 s2; };`, and regardless of which union member you assigned to, you can access .x through either union member. – mtraceur Nov 29 '21 at 19:51
  • Of course the third way requires that you assign the struct into a union. Obviously that doesn't help if you just have an address to an existing struct which is not already in a union, and you don't want the semantics of copying that struct into a union variable. I mostly mention it for completeness. The bigger point here is that these are all ways provided by the standard which do let you indicate that two different struct types should be usable interchangeably. – mtraceur Nov 29 '21 at 20:11
  • I guess clang regards structures types without tags as being equivalent if they contain identical members, but if you assign a tag to one structure type but not the other, clang will treat the structs as alias-incompatible. GCC will interpret even structures types without tags as alias-incompatible regardless of contents. – supercat Nov 29 '21 at 20:32
  • Right, but per everything else I said, I still disagree with "the Standard [...] provides no means for programmers to indicate when two structures should be usable interchangeably" - the standard provides ways to do so, as described in my other comments. – mtraceur Nov 29 '21 at 22:32
  • In pre-standard C, pointers to structure types that shared a Common Initial Sequence could be used interchangeably when reading members of the Common Initial Sequence, and generally when writing as well. While there are ways of writing functions that use explicit address computations or byte-level manipulations to act upon any blob of memory with a known format, the Standards don't provide any means of specifying that a function which is written *using struct member-access operators* should be able to operate interchangeably on all structures that share a CIS containing applicable members. – supercat Nov 29 '21 at 22:47
  • @mtraceur: Trying to encapsulate the CIS within its own structure only works if the length of the Common Initial Sequence is a multiple of its alignment requirement, clutters *all code* which uses the CIS with extra `.header.` member-accesses, and further requires that all code which uses the CIS use the same defined header type to do so. No such limitations applied to the pre-Standard "it just works" semantics. – supercat Nov 29 '21 at 22:50
  • All fair points (although the `.header.` can be eliminated in recent C versions that support anonymous struct members), but if you're confident that the CIS is always laid out the same in memory (and you could argue that this is implicitly required by the union special case for CIS struct members combined with the ability to get pointers to individual union members) then "hey compiler, these pointers could alias!" is _all_ you need to do that, and the C standard gives you ways to clearly indicate to the compiler that two pointers could alias. – mtraceur Nov 29 '21 at 23:06
  • Basically, if you have `struct common { int x; }; int f(struct common * p1, struct common * p2) { p1->x = 1; p2->x = 2; return p1->x; }` in one translation unit, then in any other translation unit (even without including the `struct common` definition from some header) you still have the ability to pass a pointer to any other struct with the same CIS to `f`. The standard does not explicitly define this behavior, but it seems to logically follow from what it does define that you can do it in ways necessarily safe from any possible strict-aliasing optimizations. – mtraceur Nov 29 '21 at 23:18
  • @mtraceur: If an implementation extends the semantics of the language by treating all calls between compilation units as though they synchronize the states of the abstract and physical machines--something older implementations couldn't avoid doing--then many things will be possible that aren't possible absent such extension. Unfortunately, there's standard means of indicating when program correctness would require that implementations behave in that fashion even when link-time-optimization would give them the ability to do otherwise. – supercat Nov 29 '21 at 23:50
  • Actually I think I am wrong about it being possible to remain safe from optimizations based on strict-aliasing assumptions like that: I could see a particularly eager whole-program optimization one day deciding that `f((struct common *)&struct_of_another_type_with_CIS, ...);` in another translation unit is undefined behavior. (Edit: yeah your last comment seems to suggest you were thinking something similar.) – mtraceur Nov 29 '21 at 23:50
  • @mtraceur: The authors of the Standard expected that in cases where transitively applying parts of the Standard and an implementation's documentation would define the behavior of some action, but other parts of the Standard characterize it as UB, implementations would give seek to give priority to the former whenever their customers would find that useful. There was thus no need to try avoid classifying as UB constructs that quality implementations should obviously be processed usefully. Unfortunately, some compiler writers interpret the lack of requirements as inviting nonsensical behavior. – supercat Nov 29 '21 at 23:55
  • Still, it's not fair to say that the standard provides no way to indicate that two structures are intended to be used interchangeably. It *is* fair to say that it provides no way to say that all structs with a CIS should be usable interchangeably, and that it puts burden/boilerplate on the programmer that pre-standard C didn't do, by forcing us to indicate that two struct types with a CIS are intended to be used interchangeably on either a per-struct basis, or by putting them into a union. – mtraceur Nov 30 '21 at 00:01
  • Sorry there's a race condition developing between our comments. My last comment was a follow-up to my prior comment, made before seeing your comment which happened in between. – mtraceur Nov 30 '21 at 00:02
  • Anyway, I think I'm maybe being needlessly pedantic. I just didn't want a C novice to read "unfortunately, the C standard [...] provides no means for programmers to indicate when two structures should be usable interchangeably" and conclude that this is an absolute inability, with no well-defined way to do it (rather than something that can be done in a standard-conforming portable manner, just with some boilerplate and other adjustments to the code). – mtraceur Nov 30 '21 at 00:13
  • @mtraceur: Being able to write a function that will accept objects of multiple structure types interchangeably *by avoiding having it access the object using any of the types* is very different from being able to *use the structure types interchangeably*. – supercat Nov 30 '21 at 00:15
  • @mtraceur: Besides, there's no reason for programmers to jump through the hoops you describe to try to appease obtuse compiler maintainers, rather than simply using `-fno-strict-aliasing`. When that flag is used, one can exploit the "It Just Works" semantics that CIS guarantees were intended to provide, and when the flag isn't used both clang and gcc are prone to have one phase of optimization turn constructs that jump through the necessary hoops to be strictly conforming into other constructs which would yield identical machine code but have different aliasing implementations... – supercat Nov 30 '21 at 00:22
  • ...in some corner cases, and then have another phase of optimization treat the corner cases in question as UB despite the fact that they were unambiguously defined in the original source code. GCC's especially bad here, sometimes converting something like `if (flag) ((T1*)p)->x = 2; else ((T2*)p)->x = 2;` into an unconditional `((T2*)p)->x = 2;` and ignore the possibility that it might affect an object of type `T1`. – supercat Nov 30 '21 at 00:24
  • Ohhhh I see the distinction. Yeah I guess that's true to a large extent. I guess to me the union-struct-members-with-CIS trick counts as enabling using structs interchangeably in some cases. If we have function `f(struct { int x; float f; } * p);` which only uses `x`, it is well-defined to assign an instance of `struct { int x; char * p;};` to the `.from` member of `union { struct { int x; float f; } to; struct { int x; char * p;} from; };` and then pass the address of the union's `.to` as the `p` argument to `f`. To me this words as being able to use the types interchangeably in some cases. – mtraceur Nov 30 '21 at 00:30
  • @mtraceur: The way clang and gcc interpret the CIS rule, the address-of operator is essentially useless with union members, since the pointer cannot be reliably used to access the member type, even in contexts where the complete union object would be visible. Indeed, given `union foo {S1 arr1[5]; S2 arr2[5]} u;`, neither compiler will recognize that an access to `(u.arr1+i)->x` might affect `*(u.arr2+j)->x` when the members are accessed using pointer-arithmetic syntax rather than array-element syntax. – supercat Nov 30 '21 at 03:49
  • That's interesting. It makes sense though, because the standard only makes a special case exception for access of the CIS of union members of different struct types, and I would not expect that to apply to array-of-struct types. – mtraceur Nov 30 '21 at 04:42
  • To be clear, the union-struct-member-CIS special case I am talking about is the one in [C89 draft 3.3.2.3](https://port70.net/~nsz/c/c89/c89-draft.html#3.3.2.3): "If a union contains several structures that share a common initial sequence, and if the union object currently contains one of these structures, it is permitted to inspect the common initial part of any of them." To me, that wording cannot mean either `u.arr1` or `u.arr2`, but it does mean that if we have `union { struct S1 s1; struct S2 s2; } u2;`, we can assign to `u2.s1` and then access `u2.s2.x` (including through `(&u.s2)->x`). – mtraceur Nov 30 '21 at 04:57
  • @mtraceur: If the structures were different size, things wouldn't line up, but if they are the same size the elements would be guaranteed to coincide, and the only question would be whether a compiler should be willfully blind to the fact that pointers are both derived from objects of the same union type. – supercat Nov 30 '21 at 13:36