-2

I'm trying to get a better understanding of the C standard. In particular I am interested in how pointer arithmetic might work in an implementation for an unusual machine architecture.

Suppose I have a processor with 64 bit wide registers that is connected to RAM where each address corresponds to a cell 8 bits wide. An implementation for C for this machine defines CHAR_BIT to be equal to 8. Suppose I compile and execute the following lines of code:

char *pointer = 0;
pointer = pointer + 1;

After execution, pointer is equal to 1. This gives one the impression that in general data of type char corresponds to the smallest addressable unit of memory on the machine.

Now suppose I have a processor with 12 bit wide registers that is connected to RAM where each address corresponds to a cell 4 bits wide. An implementation of C for this machine defines CHAR_BIT to be equal to 12. Suppose the same lines of code are compiled and executed for this machine. Would pointer be equal to 3?

More generally, when you increment a pointer to a char, is the address equal to CHAR_BIT divided by the width of a memory cell on the machine?

Dschumanji
  • 125
  • 7
  • The value of `pointer` is not depending on the width of `char`. – Eugene Sh. Jun 04 '18 at 19:17
  • 1
    `C` says that `sizeof char` is always one. – cleblanc Jun 04 '18 at 19:20
  • The C standard is intended to support (in addition to typical systems with 8-bit char and 8-bit addressable memory) something like an audio DSP that has 24 bit registers and a 24 bit data bus. So CHAR_BIT is 24, and each memory address is also 24 bits. The standard doesn't really support every possible oddball architecture that can be imagined. – user3386109 Jun 04 '18 at 19:21
  • @cleblanc sizeof char is required to return 1, but sizeof returns the size of the object in terms of bytes. However, bytes are not defined to be exactly 8 bits wide. They are only specified to be at least 8 bits wide. – Dschumanji Jun 04 '18 at 19:25
  • 1
    @Dschumanji `sizeof` returns the size in terms of `sizeof(char)`s – Eugene Sh. Jun 04 '18 at 19:27
  • 1
    @Dschumanji yes, exactly. `sizeof char` is always one, even on a 24bit addressable audio DSP where a character is 24bits wide. – cleblanc Jun 04 '18 at 19:27
  • @EugeneSh. The standard defines sizeof(char) to be one byte. – Dschumanji Jun 04 '18 at 19:29
  • 4
    No. It is defining `sizeof(char)` to be equal to 1. – Eugene Sh. Jun 04 '18 at 19:30
  • I'm not sure, but I suspect it would be very difficult to implement C properly on a machine where memory was addressable 4 bits at a time, where to access 8 bits, and then the next 8 bits, you'd need an address that increased by 2. – Steve Summit Jun 04 '18 at 19:32
  • @EugeneSh. That was a typo. It has been corrected. sizeof(char) is defined to return 1 byte. So saying "sizeof returns the size of the object in terms of bytes" is not wrong. – Dschumanji Jun 04 '18 at 19:33
  • But there is indeed some unclarity. [Here](http://port70.net/~nsz/c/c11/n1570.html#3.6p1) it is defining a byte. [here](http://port70.net/~nsz/c/c11/n1570.html#6.5.3.4p2) it says `sizeof` is returning size in bytes. These two are somewhat contradictory (if we say `char` is not necessarily the same as `byte`). [here](http://port70.net/~nsz/c/c11/n1570.html#6.2.5p3) `char` is defined similarily, yet it is not mandating `char` to be the same as `byte`. – Eugene Sh. Jun 04 '18 at 19:38
  • "After execution, pointer is equal to 1" --> C does not specify the value of pointers aside from `0` equates to a _null pointer_. – chux - Reinstate Monica Jun 04 '18 at 19:53
  • For the record 4 bit MCUs do exist. I'm not sure if any C implementation was ever made for them though, because they were so ridiculously low-end, no stack etc. But if one were to make a C compiler for it, well you might end up in this situation, because `char` must be large enough to fit the basic character set. – Lundin Jun 04 '18 at 19:56
  • 1
    As for `sizeof`, the standard says "The sizeof operator yields the size (in bytes) of its operand..." and then "When sizeof is applied to an operand that has type char , unsigned char ,or signed char , (or a qualified version thereof) the result is 1". – Lundin Jun 04 '18 at 19:58
  • @Lundin Which is mandating the size of char to be one byte? – Eugene Sh. Jun 04 '18 at 19:59
  • @EugeneSh. Indeed. But it doesn't say how many bits a byte can have. – Lundin Jun 04 '18 at 20:00
  • @user3386109 I would argue that it is not *very* clear. – Eugene Sh. Jun 04 '18 at 20:28
  • @user3386109 It is needed to read in three different places to deduce this fact. I call it not very clear. – Eugene Sh. Jun 04 '18 at 20:37
  • @EugeneSh. I think the confusion in this question comes from the fact that people have a preconceived notion of what a "byte" is. The word "byte" in the C specification has a very specific meaning, and it doesn't mean what most people think it means. I suppose that's a bad choice of terminology by the standard's committee. – user3386109 Jun 04 '18 at 20:44
  • 1
    @EugeneSh. I agree it's annoying to have to refer to three different places, but for what it's worth, I would say that this fact -- that a byte might be bigger than 8 bits -- is one of those pieces of erudite trivia that any self-respecting C expert does know. (Entirely granted that it's likely to be confusing to the lay public.) And it's not just C; it's just an antique definition of "byte". See also the [Jargon File entry](http://www.catb.org/jargon/html/B/byte.html). – Steve Summit Jun 04 '18 at 22:15
  • The reason why C supports esoteric systems is because it was used to implement Unix for an "esoteric system", PDP-11, with 36-bit words, and non-machine addressable 9-bit bytes... – Antti Haapala -- Слава Україні Jun 05 '18 at 18:40

6 Answers6

3

Would pointer be equal to 3?

Well, the standard doesn't say how pointers are implemented. The standard tells what is to happen when you use a pointer in a specific way but not what the value of a pointer shall be.

All we know is that adding 1 to a char pointer, will make the pointer point at the next char object - where ever that is. But nothing about pointers value.

So when you say that

pointer = pointer + 1;

will make the pointer equal 1, it's wrong. The standard doesn't say anything about that.

On most systems a char is 8 bit and pointers are (virtual) memory addresses referencing a 8 bit addressable memory loacation. On such systems incrementing a char pointer will increase the pointer value (aka memory address) by 1. However, on - unusual architectures - there is no way to tell.

But if you have a system where each memory address references 4 bits and a char is 12 bits, it seems a good guess that ++pointer will increase the pointer by three.

Support Ukraine
  • 42,271
  • 4
  • 38
  • 63
  • 1
    Thank you for the response! I think this may be the best answer so far. So even for char *, it may not be universally true that adding 1 to it will give the next address? – Dschumanji Jun 04 '18 at 19:52
  • On **all systems** a char is one byte, because that's the definition of the term *byte* in the C standard. – Antti Haapala -- Слава Україні Jun 04 '18 at 19:52
  • @Dschumanji The address might be 3 nibbles or whatever your fictional computer uses. But 1 byte increased. – Lundin Jun 04 '18 at 19:54
  • 1
    The last paragraph of this answer is wrong. The C standard does not support systems where memory is 4-bit addressable and a char is 12 bits. – user3386109 Jun 04 '18 at 20:22
  • @user3386109 Do you have a citation for that? I had that suspicion as well, but I don't remember it being stated explicitly anywhere, and I'm not sure it's even implied in an ironclad way by the existing facts that `sizeof(char) === 1` and `CHAR_BIT >= 8`. – Steve Summit Jun 04 '18 at 21:41
  • @SteveSummit I posted [an answer](https://stackoverflow.com/a/50687920/3386109) that has the relevant sections from the specification. – user3386109 Jun 04 '18 at 21:44
  • 1
    @user3386109 Okay, thanks, and those match my memory, but they don't say that a byte must be the *smallest* addressable unit, and therefore I don't think they rule out a machine that has addressable units smaller than the smallest C object. – Steve Summit Jun 04 '18 at 21:56
  • @user3386109 - Please provide specific evidence that a C compliant implementation can't be implemented on a system where memory is 4 bit addressable. Remember that the C standard uses an abstract reference machine - not any physical machine. None of the quotes in the link you provide states that memory can't be 4 bit addressable. – Support Ukraine Jun 05 '18 at 17:50
  • @AnttiHaapala Yes, a char is 1 byte (in the C definition of a byte). Where did I state otherwise? I like to correct the answer if I wrote that. – Support Ukraine Jun 05 '18 at 17:54
  • @4386427 *"On most systems a char is 8 bit (aka 1 byte)"*, here the aka part as I read it refers to it being 8 bits... – Antti Haapala -- Слава Україні Jun 05 '18 at 18:28
0

Pointers are incremented by the minimum of they width of the datatype they "point to", but are not guaranteed to increment to that size exactly.

For memory alignment purposes, there are many times where a pointer might increment to the next memory word alignment past the minimum width.

So, in general, you cannot assume this pointer to be equal to 3. It very well may be 3, 4, or some larger number.

Here is an example.

struct char_three {
   char a;
   char b;
   char c;
};

struct char_three* my_pointer = 0;
my_pointer++;

/* I'd be shocked if my_pointer was now 3 */

Memory alignment is machine specific. One cannot generalize about it, except that most machines define a WORD as the first address that can be aligned to a memory fetch on the bus. Some machines can specify addresses that don't align with the bus fetches. In such a case, selecting two bytes that span the alignment may result in loading two WORDS.

Most systems don't accept WORD loads on non-aligned boundaries without complaining. This means that a bit of boiler plate assembly is applied to translate the fetch to the proceeding WORD boundary, if maximum density is desired.

Most compilers prefer speed to maximum density of data, so they align their structured data to take advantage of WORD boundaries, avoiding the extra calculations. This means that in many cases, data that is not carefully aligned might contain "holes" of bytes that are not used.

If you are interested in details of the above summary, you can read up on Data Structure Alignment which will discuss alignment (and as a consequence) padding.

Edwin Buck
  • 69,361
  • 7
  • 100
  • 138
  • Pretty much every 8 bit CPU and most 16 bit won't pad the struct, so `my_pointer++` would indeed increase the address by 3 bytes on those systems. – Lundin Jun 04 '18 at 19:28
  • Not to mention that the struct itself, were it padded, would have the padding reflected in its `sizeof`. – Christian Gibbons Jun 04 '18 at 19:29
  • @ChristianGibbons Your statement isn't always true, that's why there are Stackoverflow questions like https://stackoverflow.com/q/119123/302139 . Sure, most people don't get into the corner cases by typically requesting things that end on WORD alignment, with padding in the middle; but, if you have your padding at the end, it's an unpleasant surprise when you discover that they don't match. – Edwin Buck Jun 04 '18 at 19:37
  • @Lundin I'd readily agree on 8 bit systems, but for 16 bit systems, if the memory bus was 16 bits wide and didn't support non-aligned memory access, it seems that 3 would be an improbable answer. – Edwin Buck Jun 04 '18 at 19:39
  • @EdwinBuck That question is about why the sizeof the struct is not equal to the size of its individual members, which supports, rather than contradicts, my statement that the sizeof the struct includes the padding. – Christian Gibbons Jun 04 '18 at 19:40
  • @EdwinBuck Thank you for the response, Edwin Buck. I think your first sentence answers my question. Do you know where something like that is stated in the standard? – Dschumanji Jun 04 '18 at 19:45
  • @ChristianGibbons Perhaps I wasn't clear, but there can be additional non-struct padding in arrays of a type. It's padding after the struct, and explains why sizeof on arrays isn't the same as sizeof the struct * element items. So yes, I did support your statement, but was trying to add in an important caveat. I hope in rereading my words, this is a bit more clear. – Edwin Buck Jun 04 '18 at 19:45
  • @Dschumanji This is part of the "implementation dependent" part of the standard, so you're going to have to find out what your answer is for your platform. – Edwin Buck Jun 04 '18 at 19:46
  • 2
    The point here is however that pointer arithmetic doesn't pick the size of an item. In case there are alignment restrictions, the size of the example struct is 4. The padding is added to the struct itself, regardless of if it's involved in arithmetic or not. This to ensure that it always gets allocated at an aligned address. – Lundin Jun 04 '18 at 19:51
0

char *pointer = 0;
After execution, pointer is equal to 1

Not necessarily. This special case gives you a null pointer, since 0 is a null pointer constant. Strictly speaking, such a pointer is not supposed to point at a valid object. If you look at the actual address stored in the pointer, it could be anything.

Null pointers aside, the C language expects you to do pointer arithmetic by first pointing at an array. Or in case of char, you can also point at a chunk of generic data such as a struct. Everything else, like your example, is undefined behavior.

An implementation of C for this machine defines CHAR_BIT to be equal to 12

The C standard defines char to be equal to a byte, so your example is a bit weird and contradicting. Pointer arithmetic will always increase the pointer to point at the next object in the array. The standard doesn't really speak of representation of addresses at all, but your fictional example that would sensibly increase the address by 12 bits, because that's the size of a char.

Fictional computers are quite meaningless to discuss even from a learning point-of-view. I'd advise to focus on real-world computers instead.

Lundin
  • 195,001
  • 40
  • 254
  • 396
  • Good point. If one wants to look at some odd, real world computers. I'd recommend the 4004 (4 bit word), Cyber Mainframes (60 bit word, 6 bit byte), DEC 10 and DEC 20 (36 bit word, 9 bit byte), and more https://en.wikipedia.org/wiki/Word_(computer_architecture) Then there's the variable word length architectures (which used a stop value to end the word), like the early IBM systems. In short, there's plenty of oddities to look over without dreaming up new ones. – Edwin Buck Jun 04 '18 at 20:05
  • @Lundin I was hesitant about using 0 as an example for this exact reason. How is my example of a byte being defined as bits contradictory when the standard doesn't define a byte to be exactly 8 bits? Your answer and some others are indicating that the pointer incrementing by 3 is a sensible answer, but not something that must be enforced by the standard. – Dschumanji Jun 04 '18 at 20:05
  • @Dschumanji The 8 bit byte is popular now due to IBM's wholesale adoption of that size with the IBM 360's which were some of the first computers to treat text processing as a primary feature. ASCII fit into 8 bits, leaving a few (128) extra characters for whatever they could dream they needed. – Edwin Buck Jun 04 '18 at 20:07
  • 1
    @Dschumanji You can't really use C in a sensible way on a system with 4 bit addressable units but 12 bit characters. I suppose they could go with `uint4_t` as a compromise. By increasing a character pointer by 1, you increase the address by 1 byte. How big 1 byte is on the given system, the C standard doesn't say. It only says that it must be at least 8 bits. – Lundin Jun 04 '18 at 20:08
0

Seems like the confusion in this question comes from the fact that the word "byte" in the C standard doesn't have the typical definition (which is 8 bits). Specifically, the word "byte" in the C standard means a collection of bits, where the number of bits is specified by the implementation-defined constant CHAR_BITS. Furthermore, a "byte" as defined by the C standard is the smallest addressable object that a C program can access.

This leaves open the question as to whether there is a one-to-one correspondence between the C definition of "addressable", and the hardware's definition of "addressable". In other words, is it possible that the hardware can address objects that are smaller than a "byte"? If (as in the OP) a "byte" occupies 3 addresses, then that implies that "byte" accesses have an alignment restriction. Which is to say that 3 and 6 are valid "byte" addresses, but 4 and 5 are not. This is prohibited by section 6.2.8 which discusses the alignment of objects.

Which means that the architecture proposed by the OP is not supported by the C specification. In particular, an implementation may not have pointers that point to 4-bit objects when CHAR_BIT is equal to 12.


Here are the relevant sections from the C standard:

§3.6 The definition of "byte" as used in the standard

[A byte is an] addressable unit of data storage large enough to hold any member of the basic character set of the execution environment.

NOTE 1 It is possible to express the address of each individual byte of an object uniquely.

NOTE 2 A byte is composed of a contiguous sequence of bits, the number of which is implementation-defined. The least significant bit is called the low-order bit; the most significant bit is called the high-order bit.

§5.2.4.2.1 describes CHAR_BIT as the

number of bits for smallest object that is not a bit-field (byte)

§6.2.6.1 restricts all objects that are larger than a char to be a multiple of CHAR_BIT bits:

[...] Except for bit-fields, objects are composed of contiguous sequences of one or more bytes, the number, order, and encoding of which are either explicitly specified or implementation-defined.

[...] Values stored in non-bit-field objects of any other object type consist of n × CHAR_BIT bits, where n is the size of an object of that type, in bytes.

§6.2.8 restricts the alignment of objects

Complete object types have alignment requirements which place restrictions on the addresses at which objects of that type may be allocated. An alignment is an implementation-defined integer value representing the number of bytes between successive addresses at which a given object can be allocated.

Valid alignments include only those values returned by an _Alignof expression for fundamental types, plus an additional implementation-defined set of values, which may be empty. Every valid alignment value shall be a nonnegative integral power of two.

§6.5.3.2 specifies the sizeof a char, and hence a "byte"

When sizeof is applied to an operand that has type char, unsigned char, or signed char, (or a qualified version thereof) the result is 1.

user3386109
  • 34,287
  • 7
  • 49
  • 68
  • On oddball DSPs, they say that a byte is 16 or 24 bits though. And anyone writing C instead of assembler for said DSPs is likely making a big mistake... – Lundin Jun 04 '18 at 20:03
  • Okay, and now you try porting it to an 8 bit PIC :) – Lundin Jun 04 '18 at 20:10
  • I was joking, but the main point of writing in C is otherwise that you want to be able to (im)port (parts) of the code. Which I wouldn't dare to do for such an exotic device anyway. Traditionally all code for DSP:s is written in assembler, since they are always so very task-specific. Yes, C compilers exists for DSP:s... but the main advantage of that is that you can take a C programmer to do the job without learning the oddball DSP assembler. – Lundin Jun 04 '18 at 20:27
  • Nowhere does this say that the C byte must be the *smallest* addressable unit, though. – Steve Summit Jun 04 '18 at 21:54
  • @SteveSummit It says it right there, in plain lawyer-speak (the question *is* tagged language-lawyer). §3.6 Note 1: *"It is possible to express the address of each individual byte of an object uniquely."* So bytes must be addressable. §5.2.4.2.1: *"smallest object [...] (byte)"* And a byte is the smallest object. This is confirmed by §6.2.6.1 *"any other object type consists of [...] n [...] bytes"* So a byte is the smallest addressable object, since all other objects must be a multiple number of bytes. – user3386109 Jun 04 '18 at 22:03
  • @user3386109 I think you've proved that the byte is the smallest *object*, and the smallest "unit" that can be addressed *by a C program*. I could be wrong, but I don't think you've proved that a machine may not have units that are individually addressable and are smaller than any C object (and that are therefore impossible to describe by C objects). – Steve Summit Jun 04 '18 at 22:11
  • @SteveSummit Go back to the original question: *"a processor with 12 bit wide registers that is connected to RAM where each address corresponds to a cell 4 bits wide. An implementation of C for this machine defines CHAR_BIT to be equal to 12. Suppose the same lines of code are compiled and executed for this machine. Would pointer be equal to 3?"* What is your answer to that question? And what does the accepted answer say? – user3386109 Jun 04 '18 at 22:16
  • I think that the pointer would be equal to 3 (for suitable values of "equal"), and I therefore agree with the accepted answer. It's a bizarre result, but I don't think it contradicts the Standard, because there's no way for a portable program to know what numbers pointers are equal to. And while C for a 4-bit machine would certainly be difficult and strange, I'm not yet convinced it's ruled out by the Standard. – Steve Summit Jun 04 '18 at 22:23
  • Quote: "the smallest addressable object that a C program can access" is not the same as "the smallest addressable object that a physical machine can access" – Support Ukraine Jun 05 '18 at 18:09
  • Bytes need to be addressable, but they need not be *machine-addressable*, hence the wider `char *` on 36-bit machines - that only natively support word-addressing. – Antti Haapala -- Слава Україні Jun 05 '18 at 18:33
0

When you increment a pointer to a char, is the address equal to CHAR_BIT divided by the width of a memory cell on the machine?

On a "conventional" machine -- indeed on the vast majority of machines where C runs -- CHAR_BIT simply is the width of a memory cell on the machine, so the answer to the question is vacuously "yes" (since CHAR_BIT / CHAR_BIT is 1.).

A machine with memory cells smaller than CHAR_BIT would be very, very strange -- arguably incompatible with C's definition.

C's definition says that:

  • sizeof(char) is exactly 1.

  • CHAR_BIT, the number of bits in a char, is at least 8. That is, as far as C is concerned, a byte may not be smaller than 8 bits. (It may be larger, and this is a surprise to many people, but it does not concern us here.)

  • There is a strong suggestion (if not an explicit requirement) that char (or "byte") is the machine's "minimum addressable unit" or some such.

So for a machine that can address 4 bits at a time, we would have to pick unnatural values for sizeof(char) and CHAR_BIT (which would otherwise probably want to be 2 and 4, respectively), and we would have to ignore the suggestion that type char is the machine's minimum addressable unit.

C does not mandate the internal representation (the bit pattern) of a pointer. The closest a portable C program can get to doing anything with the internal representation of a pointer value is to print it out using %p -- and that's explicitly defined to be implementation-defined.

So I think the only way to implement C on a "4 bit" machine would involve having the code

char a[10];
char *p = a;
p++;

generate instructions which actually incremented the address behind p by 2.

It would then be an interesting question whether %p should print the actual, raw pointer value, or the value divided by 2.

It would also be lots of fun to watch the ensuing fireworks as too-clever programmers on such a machine used type punning techniques to get their hands on the internal value of pointers so that they could increment them by actually 1 -- not the 2 that "proper" additions of 1 would always generate -- such that they could amaze their friends by accessing the odd nybble of a byte, or confound the regulars on SO by asking questions about it. "I just incremented a char pointer by 1. Why is %p showing a value that's 2 greater?"

Steve Summit
  • 45,437
  • 7
  • 70
  • 103
  • Thank you for your answer, Steve. I don't see anything in the specification that suggests a char should be the size of the smallest addressable unit. Section 3.6 paragraph 1 just says a byte has to be an addressable unit of storage. The stipulation that sizeof(char) is 1 byte then implies that a char only has to be the size of some addressable unit of storage. In addition, that unit of storage is defined in CHAR_BIT. – Dschumanji Jun 04 '18 at 23:56
  • I made a typo. I meant to say "In addition, the size of that unit of storage is defined by CHAR_BIT". The standard seems to aim for a definition of the size of char as the basic unit of measuring the size of non bit field objects rather than the smallest addressable unit of a machine. It seems like it can accommodate the possibility of strange architectures like the hypothetical one in my question. – Dschumanji Jun 05 '18 at 00:09
  • As others have stated, there is no stipulation that adding 1 to a char pointer must increase the address by exactly 1. It might need to be incremented by 3, like in the hypothetical architecture. – Dschumanji Jun 05 '18 at 00:15
  • @Dschumanji No, there is no stipulation, but since everyone knows that `sizeof(char)` is 1 (indeed, since newbies are regularly scolded for explicitly multiplying by `sizeof(char)` "since you don't have to and it's unnecessary and confusing"), a lot of people's heads would explode if adding 1 to a `char` pointer did not add 1, believe me. :-) – Steve Summit Jun 05 '18 at 00:25
0

The following code fragment demonstrates an invariant of C pointer arithmetic -- no matter what CHAR_BIT is, no matter what the hardware least addressable unit is, and no matter what the actual bit representation of pointers is,

#include <assert.h>
int main(void)
{
    T x[2]; // for any object type T whatsoever
    assert(&x[1] - &x[0] == 1); // must be true
}

And since sizeof(char) == 1 by definition, this also means that

#include <assert.h>
int main(void)
{
    T x[2]; // again for any object type T whatsoever
    char *p = (char *)&x[0];
    char *q = (char *)&x[1];
    assert(q - p == sizeof(T)); // must be true
}

However, if you convert to integers before performing the subtraction, the invariant evaporates:

#include <assert.h>
#include <inttypes.h>
int main(void);
{
    T x[2];
    uintptr_t p = (uintptr_t)&x[0];
    uintptr_t q = (uintptr_t)&x[1];
    assert(q - p == sizeof(T)); // implementation-defined whether true
}

because the transformation performed by converting a pointer to an integer of the same size, or vice versa, is implementation-defined. I think it's required to be bijective, but I could be wrong about that, and it is definitely not required to preserve any of the above invariants.

zwol
  • 135,547
  • 38
  • 252
  • 361