What does a C char* array look like in memory?

Question

I'm running a CRC32 polynomial division code over a char array in ARM Assembler, and I've come to realize that part of my code is based off of certain assumptions that I shouldn't make before checking. Specifically:

The size of each char (1 byte?)
The size of the array in general

Is it fair to assume that this array will be divisible into a number of 32-bit words without remainder? Also, if I modify the array in C to contain X more chars than it should have (in order to be divisible by 32 bits), but do not set those chars, what will they look like in ASM binary? Like, say I allocate the Array as

    int buffersize = 2000;
    char buffer[buffersize+3];

And then fill exactly buffersize locations in the array; is there anything to tell what the remaining 3 chars would look like in memory? I'd assume they'd be before the first 00000000 byte, as that's the end of the array, is that correct?

Err, any way to distinguish those remaining 3 chars. I'm assuming that these chars in C are 8 bit, but I keep hearing conflicting statements about it. — J.Swersey, Jan 31 '14 at 12:53
The C specification says that `sizeof(char)` will always be `1`, but the actual size of a `char` doesn't have to be one byte (i.e. 8 bits). In the 1970's, it was still not uncommon to find systems where a `char` was 7 or 9 bits. Now it's almost impossible to find a system where `char` is not 8 bits. The size of an array will always be the number of entries multiplied by the actual size of the type used for the array. — Some programmer dude, Jan 31 '14 at 12:56
A C char array is simply a sequence of bytes in memory. No surrounding data, nothing to "mark" start or end or identify the length. — Hot Licks, Jan 31 '14 at 12:57
If you don't do a memset then they will be whatever value was in that location. — James Black, Jan 31 '14 at 12:57
As for the three remaining characters in your `buffer` array, it depends on if `buffer` is declared as a global variable or if it's a local (inside a function) variable. All global variables are initialized to zero, regardless of their size. No local variables are ever automatically initialized. — Some programmer dude, Jan 31 '14 at 12:58
The size of `char` in C is supposedly "compiler defined" and can be defined differently (but always >= 8) by a specific compiler. But practically speaking, aside from some special situations, it's always 8 bits and lots of code would break if it wasn't. — Hot Licks, Jan 31 '14 at 12:59
Okay, so it's almost certainly 8 bits. That at least means I can load the final word in my code byte for byte. @HotLicks: the end of a C char array is defined by the null byte, isn't it? At least, that's what learned in class this semester... Joachim, that makes this a little more complicated, but I think I've found a way to do what I'm trying to do without that. Thanks! — J.Swersey, Jan 31 '14 at 13:20
@JonahStephenSwersey No, the end of a *string* is defined by the "null" byte. The end of an array is the beginning of the array plus the size of the array. Just because you can have a string inside of an array doesn't mean you should mix the two. In fact, if you have an array of `10` characters, but put a string of only four character plus the terminating '\0' (for a total of five characters) doesn't mean the array ends in the middle. — Some programmer dude, Jan 31 '14 at 13:23
You've got to understand that arrays are something of a fiction in C. Only a few compiler features support arrays, and there's nothing at runtime to tell you how big an array is or even if a particular piece of storage *is* an array. — Hot Licks, Jan 31 '14 at 13:33
I'll keep that in mind. So assuming that my Assembler function is given a pointer to the start of the char array and a separate integer, is there any way to determine where the char array ends? Maybe something with the stack pointer? Or do I have to give the function an extra parameter to tell it where the "array" ends? — J.Swersey, Jan 31 '14 at 13:38
There's no way to tell where the array ends without someone somehow telling you. It could be defined in the protocol to be a fixed length, there could be another parm passed to tell you the length, there could be a "terminal" value to mark the end, and probably on or two others. — Hot Licks, Jan 31 '14 at 17:11
Well, could a '\0' char appear anywhere in the array? And would any other byte look like 00000000? Because even if it's not a string, we could just attach that to the end of the array and use that for comparison, but only if it doesn't duplicate. — J.Swersey, Jan 31 '14 at 17:50
What's in the array is what's in the array. It may be that a zero value is not possible "in real life" and you can use that to mark the end. Or it may be that zero is possible but not a negative value. It all depends on what you're storing in the array. — Hot Licks, Jan 31 '14 at 17:57
Do note that another scheme for passing arrays that has not been mentioned here is "length prefix". This is done several ways, but the simplest is to just put the length in array element zero and do your indexing starting with 1 (which you've always wanted to do anyway, if you're a recycled FORTRAN guy). You do also need to remember that the actual array length needs to be one longer to account for the length value. — Hot Licks, Jan 31 '14 at 20:46
@PeterCordes Please explain why you removed `[memory]` tag and added `[arm]` tag... — autistic, Aug 12 '18 at 10:51
@autistic: because the question is asking about the in-memory layout *on ARM* for the purposes of a hand-written ARM asm function. So we can simplify it to just ARM and say something specific, without worrying about DSPs with 32-bit `char` = `int`, or old machines with `CHAR_BIT=9`. The `[assembly]` tag usually only makes sense in combination with an architecture tag. I had to get rid of something, and [tag:memory] is for questions about memory *management*, according to its tooltip. — Peter Cordes, Aug 12 '18 at 11:39

score 2 · Answer 1 · answered Aug 11 '18 at 22:49

2

The size of each char (1 byte?)

Yes, in C, a char always occupies exactly one byte, by definition here and here:

1 character single-byte character bit representation that fits in a byte

A byte however, is not required to be an octet (eight bits); implementations are allowed to use larger bytes... however, sizeof (char) is always 1 and a char value always occupies one byte.

Is it fair to assume that this array will be divisible into a number of 32-bit words without remainder?

Not if you're referring to buffer in your question, no. That occupies 2003 bytes, and 2003, being an odd number, is not divisible by 4 or any other even number.

what will they look like in ASM binary?

Attempting to use an indeterminate value is undefined behaviour in C, and so is a subject to discuss with your compiler, architecture or OS devs, rather than with respect to C.

is there anything to tell what the remaining 3 chars would look like in memory?

The values, being uninitialised, are indeterminate. Attempting to use them is undefined behaviour.

I'd assume they'd be before the first 00000000 byte, as that's the end of the array, is that correct?

If you're referring to the string terminating '\0' byte, then no, that needn't be at the end of the array. It could be anywhere in the array. What's important to remember is that it denotes the end of your string, so if you want to append to a string (i.e. like strcat does), then you need to move the string terminator...

It is unfortunate to see that there are still people who subscribe to the idiom that C-strings are arrays. The C standard doesn't say this. In fact, the C standard kind of says the opposite of this in some cases.

answered Aug 11 '18 at 22:49

autistic

1
3
35
80

`char` is 1 byte on ARM. This is an ARM question, about normal C implementations on ARM (which follow the ARM ABI). – Peter Cordes Aug 12 '18 at 01:17
In ARM assembly, reading "uninitialized" memory is not UB; as long as the correct behaviour of the program doesn't depend on the value, you're fine. ARM, like x86 and other normal modern CPUs, has no trap representations for integers. See [Intriguing assembly for comparing std::optional of primitive types](https://stackoverflow.com/a/51619203) and [Is it safe to read past the end of a buffer within the same page on x86 and x64?](https://stackoverflow.com/q/37800739) (applies equally to ARM) – Peter Cordes Aug 12 '18 at 01:19
@PeterCordes Are we talking about C code, or are we talking about ARM assembly? I'd argue that the `[arm]` tag is entirely irrelevant here. – autistic Aug 12 '18 at 10:27
It's not unreasonable to cover some generic ground in your answer that will apply to all architectures. But the question was already tagged `[assembly]` and specifically mentioned ARM. Unlike some architectures, a `char` is smaller than a register / the word-size, and there are no trap representations. These factors are highly relevant for how you handle the end-of-string / end-of-buffer issues and possibly reading outside the C object (asm doesn't have undefined behaviour the way C does). – Peter Cordes Aug 12 '18 at 11:42
@PeterCordes is `char buffer[buffersize+3];` C code, or ARM assembly? If it's C code, we must translate using a C compiler and C compilers are bound by the standards I cite from. Thus, the C standard is authoritative, and you can't say there are no trap representations (because there are trap representations, see also compiler docs for `-ftrapv`). If it's ARM assembly (it isn't, is it?) then you could refer to whichever of the umpteen ARM assembly specifications. Nonetheless, my answer to this question is entirely valid as it explains the three bytes in question have *indeterminate value*. – autistic Jun 28 '21 at 01:23
`gcc -ftrapv` generates traps on signed-overflow UB. But every possible bit-pattern for an `int` or `char` are still valid. There's no value where doing `x += 0` or `x -= x` would trap on a C implementation for ARM. C does allow implementations to have trap representations, e.g. a 1's complement machine where `~0` all-one-bits (negative zero) may trap if used in arithmetic operations, instead of being treated as zero. But C implementations for ARM use 2's complement with no padding bits in their integers, and no trap representations. – Peter Cordes Jun 28 '21 at 01:39
Other than that, I forget what nits I was picking at 3 years ago (or what the question was about), other than the fact that a C implementation for ARM will have 8-bit bytes, unless it's intentionally really weird (DeathStation 9000) >.< If you don't want to change anything, I'll just leave my comments in case anyone case. – Peter Cordes Jun 28 '21 at 01:41

What does a C char* array look like in memory?

1 Answers1